Welcome to my Exploratory Data Analysis project for Udacity’s Data Analyst Nanodegree. As I wanted my work to be somewhat more original, I forewent the option of analysing one of the datasets in the Udacity list and went on to spend a few weeks trying to locate an appropriate dataset online; appropriate in this context meant firstly satisfying Udacity prerequisites and secondly being a real-world recent dataset, relatively “virgin” with respect to analysis attempts.
There are a few excellent directories of free datasets out there, as well as free APIs. The one that intrigued me the most though was the Iowa Liquor Sales dataset. This dataset contains transactions between the relevant Iowa state authority and liquor stores across the state with a very high level of detail. I was surprised to find out that in Iowa alcohol sales are state-controlled; it seems that all orders have to go through the state! Anyway, the reasons why I found this dataset interesting were the following:
The final point was quite important for me, because what mostly intrigues me in this field is discovering relationships and patterns which are not directly obvious (well I guess that applies for most of us here).
So I then spent a bit of time trying to think of alternative data that I could merge into the master (raw) dataset. Given the richly detailed data at zipcode level, I managed to pull from the internet the following zipcode level information:
I did not know beforehand whether this would definitely provide interesting insights with regards to liquor consumption, but at least I could supplement my project with some descriptives about Iowa (which I knew next to nothing about!). Additionally, it was greatly enjoyable to exercise the skills I learnt in the Nanodegree’s Data Wrangling course. I got to use a few API’s, scraped some tables, used some regex and more in order to get all my auxiliary data sets in a neat csv format. Here I cheated a bit and used Python. After this data was in csv format though I performed all subsequent data merging operations in R.
For the analysis purposes of this project I used RStudio on my Windows 7 laptop and also used git and GitHub for version control.
So, on to the data analysis.
# Load and pre-process liquor sales data ----------------------------------
#Using amazingly fast fread function to read in 800MB csv file
library(data.table)
data1 <- fread('./data/raw/Iowa_Liquor_Sales.csv', header = T, sep = ',',
verbose=TRUE)
#Convert dates to Date objects
data1$DATE <- as.Date(data1$DATE, "%m/%d/%Y")
#Fix ZIPCODE issues
data_zipcodes <- unique(data1$ZIPCODE)
#Rogue zipcodes are 97 (sioux city, Morningside avenue) and NA (Dunlap).
##Dunlap is 61525 and Sioux city is 51106 (googled it)
summary(data_zipcodes)
min(na.omit(data_zipcodes))
max(na.omit(data_zipcodes))
sum(is.na(data1$ZIPCODE))
data1$CITY[is.na(data1$ZIPCODE)]
data1$ZIPCODE[is.na(data1$ZIPCODE)] <- 61525
data1$CITY[data1$ZIPCODE==97]
data1$ZIPCODE[data1$ZIPCODE==97] <- 51106
#Convert ZIPCODE to unordered factor (nominal variable)
data1$ZIPCODE <- factor(data1$ZIPCODE, ordered=F)
#Omit dollars in cost and price variables and convert to numeric;
#could have used gsub
data1$`STATE BTL COST` <- sapply(strsplit(data1$`STATE BTL COST`,
split='$', fixed=TRUE),
function(x) (x[2]))
data1$`STATE BTL COST` <- as.numeric(data1$`STATE BTL COST`)
data1$`BTL PRICE` <- sapply(strsplit(data1$`BTL PRICE`, split='$',
fixed=TRUE), function(x) (x[2]))
data1$`BTL PRICE` <- as.numeric(data1$`BTL PRICE`)
data1$`TOTAL` <- sapply(strsplit(data1$`TOTAL`, split='$', fixed=TRUE),
function(x) (x[2]))
data1$`TOTAL` <- as.numeric(data1$`TOTAL`)
#The STORE LOCATION variable holds latitude and longitude values
#These will be useful in plotting on maps so let's extract them.
data1$STORE_LAT <- sapply(data1$`STORE LOCATION`, extract_lat_lon)[1,]
data1$STORE_LON <- sapply(data1$`STORE LOCATION`, extract_lat_lon)[2,]
#Dropping STORE LOCATION as it is redundant
select(data1, -`STORE LOCATION`)
#2015 data are not for the full year so I will just keep 2014
data1 <- subset(data1, DATE<"2015-01-01")
# Load auxilliary data in order to merge with main ---------
# Establishment data set --------------------------------------------------
#Load data
# I found this data set on https://www.census.gov/econ/geo-zip.html (i think)
# or here http://factfinder.census.gov/
# faces/affhelp/jsf/pages/metadata.xhtml?lang=en&type=dataset&id=dataset.en.BP_2011
est_data <- fread('./data/raw/iowa_food_drink.csv', header = T, sep = ',', verbose=TRUE)
#Pre-process data
library(tidyr)
library(dplyr)
#Drop unnecessary columns. Converted to data frame.
est_data <- select(as.data.frame(est_data), -GEOGRAPHY, -NAICS2007)
#Remove duplicates
est_data <- est_data[!duplicated(est_data2),]
#Transform dataset
est_data <- spread(est_data, NAICS2007_MEANING, ESTAB)
#Set missing to zero
est_data[is.na(est_data)] <- 0
#Make it a data table again
est_data <- as.data.table(est_data)
# Population data set -----------------------------------------------------
# Load data
# I don't remember where I got this data from
pop_data <- fread('./data/raw/Iowa_population.csv', header = T, sep = ',', verbose=TRUE)
# Weather data set --------------------------------------------------------
# Ordered daily data for 2014 from
#http://www.ncdc.noaa.gov/cdo-web/datatools/records
# Unfortunately data was very incomplete.
## Load data
weather_data <- fread('./data/raw/Iowa_weather.csv', header = T,
sep = ',', verbose=TRUE)
## Pre-condition data
weather_data <- select(weather_data, -STATION, -TSUN, -AWND, -WT09, -WT01,
-WT06, -WT05, -WT02, -WT11, -WT04, -WT08, -WT03)
weather_data$ELEVATION <- as.numeric(weather_data$ELEVATION)
weather_data$LATITUDE <- as.numeric(weather_data$LATITUDE)
weather_data$LONGITUDE <- as.numeric(weather_data$LONGITUDE)
weather_data$DATE <- as.Date(as.character(weather_data$DATE), "%Y%m%d")
#Celsius convert. range of T is -200 to 200; something wrong
#weather_data$TMIN <- (weather_data$TMIN-32)*5/9
# Measurements do not include zipcode, so we need to associate zipcode to
#latitude and longitude. For this purpose, I pulled the relevant data from
#http://www.geonames.org/export/web-services.html#findNearbyPostalCodes
#by using their really nice API in Python (query input lat/lon and output
#is zipcode. I then made a .csv)
# Insert associated zipcode - lat/lon data
zip_lat_lon_data <- fread('./data/processed/ZIPCODES_LATLON.csv',
header = T, sep = ',', verbose=TRUE)
# Merge data sets
weather_data <- left_join(weather_data, zip_lat_lon_data,
by=c("LATITUDE", "LONGITUDE"))
weather_data <- select(weather_data, -V1)
#NAs in ZIPCODE. Possibly latitude and longitude not exactly same.
#Trying to populate based on min distances.
#Get NA subset to work with
nas.lon <- weather_data$LONGITUDE[is.na(weather_data$ZIPCODE)]
nas.lat <- weather_data$LATITUDE[is.na(weather_data$ZIPCODE)]
nas <- data.frame(nas.lon, nas.lat)
nas <- nas[!duplicated(nas),]
nas <- na.omit(nas)
ref.LLZ <- as.data.frame(select(zip_lat_lon_data, LONGITUDE,
LATITUDE, ZIPCODE))
library("geosphere")
# Code to add zipcode in intermediate data.frame that had NA's.
# To be merged to main data1.
for (i.na in seq(1,nrow(nas))) {
b <- c()
for (j.ref in seq(1,nrow(ref.LLZ))) {
a <- distGeo(nas[i.na,], ref.LLZ[j.ref, 1:2])
b <- append(a, b)
}
nas[i.na,3] <- ref.LLZ[which.min(b),3]
}
colnames(nas)[1:3] <- c("LONGITUDE", "LATITUDE", "ZIPCODE")
# Apply missing values fix
# Missing values condition
cond <- (is.na(weather_data$ZIPCODE))&(!is.na(weather_data$LATITUDE))
# Split data set into "good" and "bad"
weather_data.nas <- select(weather_data[cond], -ZIPCODE)
weather_data.nas <- left_join(weather_data.nas, nas,
by=c("LATITUDE", "LONGITUDE"),copy=TRUE)
weather_data.good <- weather_data[!cond]
# Rejoin them after fixing "bad" subset
weather_data <- rbind(weather_data.nas, weather_data.good)
# Do some renaming so we don't have any conlicts with main during merge
setnames(weather_data, "LONGITUDE", "W_LONGITUDE")
setnames(weather_data, "LATITUDE", "W_LATITUDE")
setnames(weather_data, "STATION_NAME", "W_STATION_NAME")
# Decided to keep only these columns; rain, snow, min/max temp
weather_data <- select(weather_data,
c(DATE, ZIPCODE, PRCP, SNOW, TMIN, TMAX))
#Convert temps as there is something wrong! (to Celsius)
#weather_data$TMIN <- (weather_data$TMIN/10 - 32)*0.5556
#weather_data$TMAX <- (weather_data$TMAX/10 - 32)*0.5556
data1 <- fread('./data/processed/IA_unis.csv',
header = T, sep = ',', verbose=TRUE)
# Universities data set ---------------------------------------------------
unis <- fread('./data/processed/IA_unis.csv', header = T,
sep = ',', verbose=TRUE)
#No need to merge with master; just for plotting
###Here the main and auxilliary data sets are merged
# 1. Merge main and auxilliary establishment data set ---------------------
data1 <- left_join(data1, est_data, by = "ZIPCODE")
# NA's present because zipcode was not in est_data3
data1 <- na.omit(data1)
#Calculate total establishments
data1$TOTAL_EST <- rowSums(data1[seq(0,nrow(data1)), 23:45,with=FALSE])
# 2. Merge main and auxilliary population data set ------------------------
# Merge population data with base
data1 <- left_join(data1, pop_data, by = "ZIPCODE")
# Write unique ZIPCODE for Python use
#write.csv2(unique(data1$ZIPCODE), file="zipcodes.csv", sep=",")
# 3. Merge main and weather data set --------------------------------------
# Firstly identify common zipcodes in two data sets
inters_zipcodes <- intersect(unique(data1$ZIPCODE),
unique(weather_data5$ZIPCODE))
# Split main data set based on zipcode intersection
data1.paired <- data1[which(data1$ZIPCODE==inters_zipcodes)]
data1.unpaired <- data1[!which(data1$ZIPCODE==inters_zipcodes)]
# Initialise columns for rbind
data1.unpaired$PRCP <- NA
data1.unpaired$SNOW <- NA
data1.unpaired$TMIN <- NA
data1.unpaired$TMAX <- NA
# Remove duplicate zipcode-date entries. These are valid as they are from
# different stations
weather_data <- weather_data[!duplicated(weather_data,
by=c("DATE","ZIPCODE"))]
data1.paired <- left_join(data1.paired, weather_data,
by=c("DATE", "ZIPCODE"))
#Re-join unpaired with enriched paired
data1 <- rbind(data1.paired, data1.unpaired)
# 4. Merge main and demographics data sets --------------------------------
# I made these data sets by scraping zipatlas tables in Python
# example URL:
# http://zipatlas.com/us/ia/zip-code-comparison/
# percentage-white-population.htm
# Nationalities data sets name parts
nationalities <- c('arab', 'asian','black','chinese','czech','danish',
'dutch','english','filipino', 'french','german','greek',
'hispanic','indian','irish','italian','japanese','korean',
'lithuanian', 'mexican','native','norwegian','polish',
'portuguese','russian','scottish','slovak','swedish',
'swiss', 'welsh','west','white')
# Which ones to keep in the end
sub_nationalities <- nationalities[c(1:4, 7,8, 10:17, 20,21,23, 25,26,32)]
# So, because of data set size, I decided to omit individual establishments
#data and only keep totals...
data1 <- data1[,c(1:22, 46:51)]
# Run left_join.nationalities program that does all pre-processing and
#joining to main.
for (nation in sub_nationalities){
left_join.nationalities(nation, data1)
}
# Variables in master data set are listed below:
colnames(data1)
## [1] "ZIPCODE" "DATE" "STORE"
## [4] "NAME" "ADDRESS" "CITY"
## [7] "STORE LOCATION" "COUNTY NUMBER" "CATEGORY"
## [10] "CATEGORY NAME" "VENDOR NO" "VENDOR"
## [13] "DESCRIPTION" "PACK" "LITER SIZE"
## [16] "STATE BTL COST" "BTL PRICE" "BOTTLE QTY"
## [19] "TOTAL" "TOTAL_EST" "POPULATION"
## [22] "PRCP" "SNOW" "TMIN"
## [25] "TMAX" "Perc_arab" "Perc_asian"
## [28] "Perc_black" "Perc_chinese" "Perc_dutch"
## [31] "Perc_english" "Perc_french" "Perc_german"
## [34] "Perc_greek" "Perc_hispanic" "Perc_indian"
## [37] "Perc_irish" "Perc_italian" "Perc_japanese"
## [40] "Perc_mexican" "Perc_native" "Perc_polish"
## [43] "Perc_russian" "Perc_scottish" "Perc_white"
## [46] "BOTTLE_QTY_NORM" "TOTAL_EST_NORM" "STORE_LAT"
## [49] "STORE_LON"
Columns from the original liquor sales data are 1-18; others come from the auxilliary sources.
# Example row in dataset
data1[1,]
## ZIPCODE DATE STORE NAME ADDRESS CITY
## 1: 50002 2014-12-08 4417 KUM & GO #76 / ADAIR 109 S 5TH ST ADAIR
## STORE LOCATION
## 1: 109 S 5TH ST\nADAIR 50002\n(41.49807185700007, -94.643468852999945)
## COUNTY NUMBER CATEGORY CATEGORY NAME VENDOR NO
## 1: 1 1012100 CANADIAN WHISKIES 115
## VENDOR DESCRIPTION PACK LITER SIZE
## 1: Constellation Wine Company, Inc. Black Velvet 12 750
## STATE BTL COST BTL PRICE BOTTLE QTY TOTAL TOTAL_EST POPULATION PRCP
## 1: 5.23 7.85 12 94.2 22 1297 NA
## SNOW TMIN TMAX Perc_arab Perc_asian Perc_black Perc_chinese Perc_dutch
## 1: NA NA NA 0 0.37 0 0 1.46
## Perc_english Perc_french Perc_german Perc_greek Perc_hispanic
## 1: 11.05 1.38 41.13 0 1.04
## Perc_indian Perc_irish Perc_italian Perc_japanese Perc_mexican
## 1: 0.22 9.75 0.16 0.07 0.89
## Perc_native Perc_polish Perc_russian Perc_scottish Perc_white
## 1: 0.07 0.4 0 1.05 98.8
## BOTTLE_QTY_NORM TOTAL_EST_NORM STORE_LAT STORE_LON
## 1: 0.00925212 0.01696222 41.49807 -94.64347
The dataset is quite rich and holds many features of potential interest. For the purposes of this project I opted to focus on the drinking patterns of Iowans with respect to time of year and locality.
As an initial attempt at producing descriptives for the dataset, I am defining below three functions that generate histograms and calculate means and number of observations. I have split the variables into three sets; the original liquor sales variables, the added meteorological variables and the added demographic variables. For the latter two categories, the statistics were obtained on data grouped by zipcode in order to be more correct.
# plot distributions instead of showing numbers
# Function for original liquor sales variables
# y is column number of master df, x is binwidth for hist
quick_hist <- function(y, x = 0) {
npl <- y-13
test =data1[, y, with=F]
test <- na.omit(test)
x = ifelse(x==0, (max(test)-min(test))/30, x)
nameold <- colnames(test)[1]
namevar <- gsub(" ", "_", nameold)
setnames(test, nameold, namevar)
meanc = as.numeric(colMeans(test[,1,with=F], na.rm=T))
plt <- ggplot(data = test, aes_string(x=colnames(test)[1])) +
geom_histogram(binwidth = x) +
geom_vline(data=test,aes_string(xintercept=as.numeric(meanc)),
linetype="dashed", colour="red", size=1) +
ggtitle(paste0(as.character(npl), ". ",
"# Obs: ", as.character(nrow(na.omit(test))),", ",
"Mean: ", as.character(round(meanc, digits=2)))) +
theme(plot.title=element_text(face="bold", size = 8))
rm(test)
return(plt)
}
# For meteorological variables - group by DATE, ZIPCODE
quick_hist_meteo <- function(y, x = 0) {
npl <- y-13
test=data1[,c(1,2,y), with=F]
test <- na.omit(test)
x = ifelse(x==0, (max(test[,3,with=F])-min(test[,3,with=F]))/30, x)
nameold <- colnames(test)[3]
namevar <- gsub(" ", "_", colnames(test)[3])
setnames(test, nameold, "temp")
test <- ddply(test, .(DATE, ZIPCODE), summarise,
temp=mean(temp, na.rm=T))
setnames(test, "temp", namevar)
meanc = mean(test[,3], na.rm=T)
plt <- ggplot(data = test, aes_string(x=colnames(test)[3])) +
geom_histogram(binwidth = x) +
geom_vline(data=test,aes_string(xintercept=as.numeric(meanc)),
linetype="dashed", colour="red", size=1) +
ggtitle(paste0(as.character(npl), ". ",
"# Obs: ", as.character(nrow(na.omit(test))),", ",
"Mean: ", as.character(round(meanc, digits=2)))) +
theme(plot.title=element_text(face="bold", size = 8))
rm(test)
return(plt)
}
# For demographic variables - group by DATE, ZIPCODE
quick_hist_demos <- function(y, x = 0) {
npl <- y-13
test <- data1[,c(1,y), with=F]
test <- data.table(na.omit(test))
x = ifelse(x==0, (max(test[,2,with=F])-min(test[,2,with=F]))/30, x)
nameold <- colnames(test)[2]
namevar <- gsub(" ", "_", colnames(test)[2])
setnames(test, nameold, "temp")
test <- ddply(test, .(ZIPCODE), summarise,
temp=mean(temp, na.rm=T))
colnames(test)[2] <- namevar
test <- data.table(test)
meanc = mean(test[,2], na.rm=T)
plt <- ggplot(data = test, aes_string(x=colnames(test)[2])) +
geom_histogram(binwidth = x) +
geom_vline(data=test,aes_string(xintercept=as.numeric(meanc)),
linetype="dashed", colour="red", size=1) +
ggtitle(paste0(as.character(npl), ". ",
"# Obs: ", as.character(nrow(na.omit(test))),", ",
"Mean: ", as.character(round(meanc, digits=2)))) +
theme(plot.title=element_text(face="bold", size = 8))
rm(test)
rm(meanc)
return(plt)
}
For the following plots I used multiplot; a very useful function from cookbook-r.
# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols: Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
library(grid)
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
The liquor sales variables in this case are number of bottles per ordered pack, volume of bottle in millilitres, the cost of the bottle for the state, the price it was sold to the merchant for, the total number of bottles ordered and the total money spent per order for the merchant. Let’s look at their distributions.
# Plot liquor sales stuff
multiplot(quick_hist(14) + theme(plot.title = element_text(size=6),
axis.title = element_text(size=6)),
quick_hist(15) + theme(plot.title = element_text(size=6),
axis.title = element_text(size=6)),
quick_hist(16) + theme(plot.title = element_text(size=6),
axis.title = element_text(size=6)),
quick_hist(17) + theme(plot.title = element_text(size=6),
axis.title = element_text(size=6)),
quick_hist(18) + theme(plot.title = element_text(size=6),
axis.title = element_text(size=6)),
quick_hist(19) + theme(plot.title = element_text(size=6),
axis.title = element_text(size=6)),
cols=2)
It is not really clear what’s going on so I am going to log-transform the x-axis of the cost-related variables and zoom-in closer for “PACK”, “LITER SIZE” and “BOTTLE QTY”.
# Adjust them so we see more things
multiplot(quick_hist(14, 2) + coord_cartesian(xlim=c(0, 50)),
quick_hist(15, 50) + coord_cartesian(xlim=c(0, 2000)),
quick_hist(16,0.1) + scale_x_log10(), quick_hist(17, 0.1) + scale_x_log10(),
quick_hist(18,2) + coord_cartesian(c(0,100)),
quick_hist(19, 0.1) + scale_x_log10(), cols=3)
So now we can see that distributions of the cost-related have transformed to bell-shaped or normal-like. We can also clearly see the discrete nature of the other variables.
Now, let’s see descriptives of the meteorological variables; again these were derived after grouping by zipcode-day pairs.
multiplot(quick_hist_meteo(22),
quick_hist_meteo(23),
quick_hist_meteo(24),
quick_hist_meteo(25),
cols = 2)
For precipitation and snow it’s hard to discern what’s going on so I will log-transform x-axes; for the temperatures I will just make the bins finer. Here are the plots:
multiplot(quick_hist_meteo(22, 0.1) + scale_x_log10(),
quick_hist_meteo(23, 0.1) + scale_x_log10(),
quick_hist_meteo(24, 2),
quick_hist_meteo(25, 2),
cols = 2)
I would like to note that in general I am a bit disappointed by the completeness of the meteorological data that I ordered from the National Climatic Data Center; I also don’t understand the units used for temperatures, as any transformations I tried did not return reasonable planet Earth values..
The above plots are a bit hard to interpret as they are, because zipcodes are not represented uniformly; e.g. zipcode x may have meteorological readings for all days whereas others just for 5 days. If one desires to perform some inference on Iowa weather, the simplest way would be to generate the same plots for individual zipcodes.
Now onto the final partition of variables, the demographics. These include ethnic groups/nationalities percentages and total drinking establishments - grouped by zipcode.
# Plot demographic variables
multiplot(quick_hist_demos(20),
quick_hist_demos(26), quick_hist_demos(27), quick_hist_demos(28),
quick_hist_demos(29), quick_hist_demos(30), quick_hist_demos(31),
quick_hist_demos(32), quick_hist_demos(33), quick_hist_demos(34),
quick_hist_demos(35), quick_hist_demos(36), quick_hist_demos(39),
quick_hist_demos(40), quick_hist_demos(41), quick_hist_demos(42),
quick_hist_demos(43), quick_hist_demos(44), quick_hist_demos(45),
cols = 5)
From plots 14, 15 and 32 I guess it is clear that Iowa’s population is predominantly white. Without knowing too much about US geodemographics, presumably this is a “heartland” characteristic?
Otherwise I am surprised to see the magnitude of the German population - almost 50% on average! It can’t be true that half the Iowa population is German people, so I guess these categories pertain to “heritage” in general. Which means almost half of the Iowans are of German descent.
Also common is English, Scottish, French and Dutch heritage. In these histograms, one can see that the black and Hispanic and white group distributions exhibit long-tails, implying that there exist specific areas where these populations are much more (or less) prevalent than generally encountered in Iowa.
The “outlier” plot category in this panel is the total number of drinking establishments per zipcode - included here for thoroughness purposes.
In addition to the above, I think it would be useful to show the most popular categories of liquors and individual types.
The dataset contains 69 different categories, so it will be hard to visualise all of them in a plot. In this case just showing a table is also informative.
gr_cat <- ddply(data1, .(`CATEGORY NAME`), summarise,
bottles=sum(`BOTTLE QTY`, na.rm=T), price=mean(`BTL PRICE`))
arrange(gr_cat, -bottles)
## CATEGORY NAME bottles price
## 1 80 PROOF VODKA 5034321 9.175411
## 2 CANADIAN WHISKIES 3063617 13.687787
## 3 SPICED RUM 1781396 14.380130
## 4 WHISKEY LIQUEUR 1092220 16.440005
## 5 TEQUILA 1080224 20.358524
## 6 BLENDED WHISKIES 1069756 10.400273
## 7 STRAIGHT BOURBON WHISKIES 1067546 17.220917
## 8 IMPORTED VODKA 1008599 20.268337
## 9 PUERTO RICO & VIRGIN ISLANDS RUM 992118 11.405754
## 10 FLAVORED VODKA 966793 10.935803
## 11 AMERICAN GRAPE BRANDIES 689098 8.059325
## 12 TENNESSEE WHISKIES 688586 20.747060
## 13 AMERICAN DRY GINS 613832 9.085220
## 14 FLAVORED RUM 538311 12.678766
## 15 AMERICAN COCKTAILS 534658 10.689236
## 16 MISC. IMPORTED CORDIALS & LIQUEURS 487998 21.449140
## 17 IMPORTED VODKA - MISC 474696 16.909476
## 18 CREAM LIQUEURS 431367 17.743147
## 19 IMPORTED GRAPE BRANDIES 372098 21.058585
## 20 IMPORTED SCHNAPPS 349139 16.437759
## 21 SCOTCH WHISKIES 301218 23.689421
## 22 MISC. AMERICAN CORDIALS & LIQUEURS 253863 11.564033
## 23 DISTILLED SPIRITS SPECIALTY 225048 15.317123
## 24 TRIPLE SEC 224300 3.974883
## 25 DECANTERS & SPECIALTY PACKAGES 216893 24.330833
## 26 PEPPERMINT SCHNAPPS 213255 7.174323
## 27 IMPORTED DRY GINS 212966 22.151287
## 28 IRISH WHISKIES 208518 25.281346
## 29 PEACH SCHNAPPS 151208 9.646169
## 30 COFFEE LIQUEURS 136482 15.525798
## 31 STRAIGHT RYE WHISKIES 126953 26.339466
## 32 BLACKBERRY BRANDIES 121103 9.163894
## 33 AMERICAN AMARETTO 119461 6.739355
## 34 SINGLE MALT SCOTCH 87382 43.527749
## 35 APPLE SCHNAPPS 65761 10.148421
## 36 AMERICAN ALCOHOL 57573 13.408440
## 37 BUTTERSCOTCH SCHNAPPS 57166 9.235604
## 38 CINNAMON SCHNAPPS 48640 11.526123
## 39 APRICOT BRANDIES 47441 9.035857
## 40 WATERMELON SCHNAPPS 39578 10.514740
## 41 MISCELLANEOUS SCHNAPPS 36833 12.033186
## 42 GRAPE SCHNAPPS 34779 10.654928
## 43 IMPORTED AMARETTO 25090 20.526154
## 44 BARBADOS RUM 24914 15.059416
## 45 ROOT BEER SCHNAPPS 24296 9.322246
## 46 PEACH BRANDIES 23588 7.696805
## 47 CHERRY BRANDIES 21856 7.767895
## 48 100 PROOF VODKA 19816 14.089263
## 49 JAMAICA RUM 19799 16.038717
## 50 STRAWBERRY SCHNAPPS 19578 7.720291
## 51 RASPBERRY SCHNAPPS 17918 8.544250
## 52 FLAVORED GIN 17434 10.538608
## 53 TROPICAL FRUIT SCHNAPPS 14127 7.256225
## 54 GREEN CREME DE MENTHE 11782 6.934776
## 55 SINGLE BARREL BOURBON WHISKIES 9574 34.392213
## 56 AMERICAN SLOE GINS 8731 6.828953
## 57 WHITE CREME DE CACAO 8543 6.967944
## 58 DARK CREME DE CACAO 8263 6.967815
## 59 LOW PROOF VODKA 5690 12.502111
## 60 MISCELLANEOUS BRANDIES 4580 24.599805
## 61 SPEARMINT SCHNAPPS 4313 7.203551
## 62 OTHER PROOF VODKA 3876 11.962781
## 63 ROCK & RYE 3611 10.966809
## 64 BOTTLED IN BOND BOURBON 3070 14.623684
## 65 WHITE CREME DE MENTHE 2839 6.999466
## 66 1755 33.695673
## 67 CREME DE ALMOND 1710 7.001005
## 68 ANISETTE 1491 6.927277
## 69 HIGH PROOF BEER 38 109.174286
We see that 80 proof vodka is the most popular liquor type in Iowa by some margin. Canadian whiskies are also very popular. Also popular are rums, cheap whiskies and tequila. There also exists a category without a name which has a relatively high average price.
On the contrary, the least popular liquors are high proof beer (only 38 bottles sold - a categorisation outlier? A one-off order?) and liqueurs like Anisette (I imagine this is like ouzo), crème liqueurs, and spearmint schnapps (peppermint schnapps on the other hand is quite popular - both must taste like cough medicine!).
Let’s see the most and least popular individual liquors.
gr_item <- ddply(data1, .(DESCRIPTION), summarise,
bottles=sum(`BOTTLE QTY`, na.rm=T), price=mean(`BTL PRICE`))
#Most popular
arrange(gr_item, -bottles)[1:30,]
## DESCRIPTION bottles price
## 1 Black Velvet 1422405 10.191468
## 2 Hawkeye Vodka 1036303 7.313998
## 3 Captain Morgan Spiced Rum 656577 17.603547
## 4 Fireball Cinnamon Whiskey 572895 15.620919
## 5 Jack Daniels Old #7 Black Lbl 473872 25.320187
## 6 Five O'clock 466019 7.092719
## 7 Barton Vodka 398614 8.211223
## 8 Mccormick Vodka Pet 371391 6.719349
## 9 Smirnoff Vodka 80 Prf 365132 11.796400
## 10 Absolut Swedish Vodka 80 Prf 320092 19.418173
## 11 Crown Royal Canadian Whisky 318719 25.130260
## 12 Phillips Vodka 307152 9.240963
## 13 Bacardi Superior Rum 301426 13.453927
## 14 Seagrams 7 Crown Bl Whiskey 300828 12.463819
## 15 Jagermeister Liqueur 287670 17.999875
## 16 Mccormick Vodka 278271 3.637401
## 17 Paul Masson Grande Amber Brandy 275110 5.361728
## 18 Jim Beam 269741 16.211123
## 19 Admiral Nelson Spiced Rum 264521 12.053835
## 20 Five Star 260876 7.090375
## 21 Jose Cuervo Especial Reposado Tequila 242899 16.597364
## 22 Paramount White Rum 239648 9.024818
## 23 Canadian Ltd Whisky 235012 9.412726
## 24 Southern Comfort 229259 12.327208
## 25 Seagram's Extra Dry Gin 223595 8.402159
## 26 Captain Morgan Original Spiced 223252 13.245653
## 27 Five O'clock Vodka 220975 7.155968
## 28 Grey Goose Vodka 219106 24.536126
## 29 Malibu Coconut Rum 215779 14.376951
## 30 Uv Blue (raspberry) Vodka 194191 11.437719
Surprisingly I’ve never noticed most of these spirits at bars in Europe. Altogether though this list was what I expected: vodkas, whiskeys and rums and there are a few global household names in there such as Captain Morgan, Jack Daniels, Smirnoff, Jagermeister, etc.. One thing is for sure though.. Iowans love vodka and particularly the cheaper local variety! I’m also a bit curious regarding the popularity of cinnamon whiskey. Maybe a success with the student population?
#Least popular
arrange(gr_item, bottles)[1:30,]
## DESCRIPTION bottles price
## 1 Calvados Morin XO 1 70.50
## 2 Framboise (Raspberry Liqueur) 1 45.89
## 3 Glendronach 18yo 1 109.88
## 4 John J Bowman Str. Bbn.Why. HA 1 23.40
## 5 Calvados Morin Hors d' Age 2 54.00
## 6 Cedar Ridge Barrel Proof Bourbon 2 8700.00
## 7 Espolon Blanco 2 20.13
## 8 Fernandaez Brandy De Jerez 2 54.96
## 9 Glendronach 14yr Virgin Oak - DNO 2 69.30
## 10 Grappa di Recioto Amarone 2 117.36
## 11 Hirsch Straight Bourbon Small Batch 2 33.00
## 12 Jameson 12YR HA 2 44.99
## 13 Johnnie Walker Odyssey 2 747.96
## 14 Macallan Sherry Cask 25 Year Old Scotch - HA 2 845.10
## 15 New Amsterdam Citron Mini 2 6.93
## 16 Sam Adams Utopias 2 118.96
## 17 Seagram's Seven Crown American Spiced Mini 2 9.26
## 18 Taaka King Cake Vodka 2 6.00
## 19 Tooters 50ml Marg, OTB, Mango 2 65.25
## 20 Toucano Rum 2 26.04
## 21 Uncle Bob's Root Beer 2 15.38
## 22 1800 Ultimate Blueberry RTD 3 14.25
## 23 Glenmorangie 25yr 3 431.99
## 24 Hiram Walker Blue Curacao 3pak 3 5.85
## 25 Johnnie Walker King George V HA 3 444.98
## 26 Libertine Absinthe(french Absinthe) 3 64.13
## 27 Seagram's Seven Crown American Spiced 3 10.50
## 28 Admiral Nelson Cherry Spiced Rum Mini 4 5.66
## 29 Ciroc Luxury Mini 4 44.93
## 30 Irishman Founders Reserve 4 25.01
All items in the top 30 least popular list are in a sense outliers as at most 4 bottles have been ordered altogether. These probably correspond to single orders, and it is likely that some kind of data recording inconsistency has taken place (e.g. product description entered somehow differently).
Interestingly however, a few rather expensive liquors are included in this list - mainly bourbons and whiskeys. For example, two orders were placed for “Cedar Ridge Barrel Proof Bourbon” - with a bottle price of 8700 (!). It turns out that two separate orders were placed on 27/10 and 29/10 by two separate “Hy-Vee Food” stores in different zipcodes in Cedar Rapids. Interesting - anyone’s guess really as to the background story. It has to be a special order for a specific customer (maybe they broke the first bottle? An expensive mistake!).
In this part I intended to show some exploratory scatter plots between variables in order to visually identify any stand-out associations, by making use of the great ggpairs function of GGally. There are a couple of issues though. For one, the number of observations is quite large (~2.5m) and also, the dataset contains 49 variables. There is no way ggpairs will produce a sensible display so we need to be a bit clever and restrict what we feed it.
As a first attempt, let’s look at associations between proportions of national heritage group proportions.
This one is fairly easy to display, as observations are unique per zipcode (thus restricting the number of observations to about 350 - the number of unique zipcodes) and I have included 20 relevant variables in the data. Let’s see what we get.
gr_zip_pop <- ddply(data1[,c(1,24, seq(29,68,2), 72, 73), with=F], .(ZIPCODE),
numcolwise(mean, na.rm=T))
gr_zip_pop$ZIPCODE <- factor(gr_zip$ZIPCODE, ordered=F)
ggpairs(gr_zip_pop, columns=3:22, axisLabels = 'internal',
upper = "blank", lower=list(params=list(size=1)),
params = c(Shape = I("."), outlier.shape = I(".")),
title="Scatter plots between national heritage group proportions") +
theme(legend.position = "none",
panel.grid.major = element_blank(),
axis.ticks = element_blank(),
axis.text = element_blank())
The display is a bit compressed but we can still observe positive linear correlations between hispanic and mexican, english and scottish, english and irish, french and irish. We can also spot a few negative correlations such as german and english, german and french, black and white, asian and white. I think these displays aren’t very informative.
Next, let’s look at scatter plots between the main variables of interest in the original liquor sales data. I will perform the analysis on a reasonably sized subset of the data by restricting to a single zipcode (50314 - the highest populated zipcode in Des Moines).
pairs2 <- data1[data1$ZIPCODE==50314][,c(14, 15,16,17,18,19),with=F]
#ggpairs does not like spaces in variable names
setnames(pairs2, "STATE BTL COST", "STATE_BTL_COST")
setnames(pairs2, "LITER SIZE", "LITER_SIZE")
setnames(pairs2, "BTL PRICE", "BTL_PRICE")
setnames(pairs2, "BOTTLE QTY", "BOTTLE_QTY")
ggpairs(pairs2, axisLabels = 'internal',
upper = "blank", lower=list(params=list(size=1)),
params = c(Shape = I("."), outlier.shape = I("."))) +
theme(legend.position = "none",
panel.grid.major = element_blank(),
axis.ticks = element_blank(),
axis.text = element_blank())
So the raw output here is not very informative and in order to see more one has to adjust axes, convert scales, etc. This will take place for variables of interest later on in this work. But a quick scan through the plots shows a very strong linear correlation between state buying and selling price of items; this implies a fixed profit (in terms of percentage) is set by the state. Also correlated, as expected, are bottle quantities sold per transaction and total money spent per transaction.
I spent some time trying some variable groupings with ddply in order to find some interesting combinations to input to ggpairs but was unable to produce any particularly interesting visualisations at this preliminary exploratory stage. I think that due to plot size restrictions, ggpairs is not the best tool to use for a dataset of this magnitude in terms of number of observations and variables. I also found it was quite hard to find a way to adjust font sizes in the various sections of the output - especially on the diagonal.
In the following plots we will perform a high-level analysis of basic demographics.
Firstly let’s look at population per zipcode, so we need to group accordingly (we’ll use the gr_zip_pop variable we defined earlier on).
#Prep for map plotting
#Get map of Iowa using RgoogleMaps and ggmap
CenterOfMap <- geocode("41.9137948,-93.3293731")
Iowa <- get_map(c(lon=CenterOfMap$lon, lat=CenterOfMap$lat), zoom = 7,
source = "stamen", maptype= "toner", color="bw")
IowaMap <- ggmap(Iowa)
So now we’ve grouped per zipcode and pulled the population.
I realised here that I had not actually merged the coordinates of “zipcode centre” to the master. I got around this by calculating the mean latitude and longitude of stores in the zipcode. Below lies a the map of Iowa where we overlay population bubbles per zipcode and also the location of universities; here the triangle size indicates number of students enrolled.
# Explore demographics ----------------------------------------------------
suppressWarnings(IowaMap +
geom_point(data=gr_zip_pop, aes(x=STORE_LON, y=STORE_LAT, size=POPULATION),
colour="red", alpha=0.5) +
geom_point(data = unis, aes(x=lon, y=lat, size=enrolled), colour="blue",
shape=17, alpha=0.5) +
theme(legend.position="none", axis.title.x=element_blank(),
axis.title.y=element_blank(), axis.ticks=element_blank(),
axis.text.x=element_blank(), axis.text.y=element_blank(),
plot.title = element_text(size = rel(1), colour = "blue")) +
labs(title="Map of Iowa with population and university overlays"))
We can see that Davenport and Iowa city are big University cities. Regarding population, Des Moines, Waterloo, Cedar Rapids, Dubuque and Sioux City are all cities in which certain areas (zipcodes) are highly populated. It will be more useful of course to look at overall city population and that is what we are going to do now.
gr_city_pop <- ddply(data1, .(ZIPCODE, CITY), summarise,
POPULATION = mean(POPULATION, na.rm=T),
STORE_LAT = mean(STORE_LAT, na.rm = T),
STORE_LON = mean(STORE_LON, na.rm = T))
gr_city_pop <- ddply(gr_city_pop, .(CITY), summarise,
TOT_POP = sum(POPULATION, na.rm = T),
LAT = mean(STORE_LAT, na.rm = T),
LON = mean(STORE_LON, na.rm =T))
IowaMap +
geom_point(data=gr_city_pop, aes(x=LON, y=LAT, size=TOT_POP),
colour="purple", alpha=1) +
theme(legend.position="none", axis.title.x=element_blank(),
axis.title.y=element_blank(), axis.ticks=element_blank(),
axis.text.x=element_blank(), axis.text.y=element_blank(),
plot.title = element_text(size = rel(1), colour = "blue")) +
labs(title="Map of Iowa with city population overlay")
And county populations below..
gr_county_pop <- ddply(data1, .(ZIPCODE, COUNTY), summarise,
POPULATION = mean(POPULATION, na.rm=T),
STORE_LAT = mean(STORE_LAT, na.rm = T),
STORE_LON = mean(STORE_LON, na.rm = T))
gr_county_pop <- ddply(gr_county_pop, .(COUNTY), summarise,
TOT_POP = sum(POPULATION, na.rm = T),
LAT = mean(STORE_LAT, na.rm = T),
LON = mean(STORE_LON, na.rm =T))
IowaMap +
geom_point(data=gr_county_pop, aes(x=LON, y=LAT, size=TOT_POP),
colour="blue", alpha=1) +
theme(legend.position="none", axis.title.x=element_blank(),
axis.title.y=element_blank(), axis.ticks=element_blank(),
axis.text.x=element_blank(), axis.text.y=element_blank(),
plot.title = element_text(size = rel(1), colour = "blue")) +
labs(title="Map of Iowa with county population overlay")
I think it’s clear that the centre of Iowa is Des Moines. There are some highly populated areas along the county’s contour, such as Sioux City, Omaha and Dubuque.
Before moving on to the liquor sales analysis I would like to take a look at national group populations.
#My save-space function for plotting
plot_map_nat_over <- function(y){
x <- colnames(gr_zip_pop)[y]
IowaMap +
geom_point(data=gr_zip_pop, aes_string(x="STORE_LON", y="STORE_LAT",
colour=x, alpha=x)) +
scale_colour_gradientn(colours = topo.colors(10), limits=c(0,50)) +
theme(legend.position="none", axis.title.x=element_blank(),
axis.title.y=element_blank(), axis.ticks=element_blank(),
axis.text.x=element_blank(), axis.text.y=element_blank(),
plot.title = element_text(size = rel(1), colour = "blue")) +
labs(title=x)
}
In the composite plot below I am producing something like a heat map for population percentages per zipcode for a selection of nationalities; “cold” is blue and “hot” is yellow/brown (I found it hard to overlay a global legend.)
#Produce maps with overlays for all nationalities
suppressWarnings(multiplot(plot_map_nat_over(3), plot_map_nat_over(6), plot_map_nat_over(7),
plot_map_nat_over(8), plot_map_nat_over(9), plot_map_nat_over(10),
plot_map_nat_over(11), plot_map_nat_over(12), plot_map_nat_over(13),
plot_map_nat_over(14), plot_map_nat_over(15), plot_map_nat_over(16),
plot_map_nat_over(17), plot_map_nat_over(18), plot_map_nat_over(19),
plot_map_nat_over(20), plot_map_nat_over(21),
cols=5))
And finally
#Add establishments variable to zipcode grouped data
test <- unique(cbind(data1$ZIPCODE, data1$TOTAL_EST))
test <- data.table(test)
setnames(test, c(1,2), c("ZIPCODE", "TOTAL_EST"))
test$ZIPCODE <- factor(test$ZIPCODE)
gr_zip_pop <- left_join(gr_zip_pop, test)
rm(test)
#Plot
suppressWarnings(IowaMap +
geom_point(data=gr_zip_pop, aes(x=STORE_LON, y=STORE_LAT, size=TOTAL_EST),
colour="green", alpha=0.5) +
theme(legend.position="none", axis.title.x=element_blank(),
axis.title.y=element_blank(), axis.ticks=element_blank(),
axis.text.x=element_blank(), axis.text.y=element_blank(),
plot.title = element_text(size = rel(1), colour = "blue")) +
labs(title="Map of Iowa with establishment overlays"))
OK now let’s move on to liquor sales.
Let’s audit the liquor sales data a little bit first just to make sure it is trustworthy.
Firstly some data prep:
#Make DF summarising reports for each store each month
gr_st_month_count <- ddply(data1, .(months(DATE), STORE), summarise,
count_reps=length(STORE))
setnames(gr_st_month_count,1, "month")
gr_st_month_count$month <- factor(gr_st_month_count$month,
levels=month.name)
# Make separate DF counting number of participating stores per month
gr_st_month.part_stores <- data.frame("month"=factor(rownames(
table(gr_st_month_count$month)),levels=month.name),
"part_stores"=as.numeric(table(gr_st_month_count$month)))
#Combine data sets for ggplot use
gr_st_month_count <- left_join(gr_st_month_count, gr_st_month.part_stores)
rm(gr_st_month.part_stores)
And on to a boxplot which illustrates the monthly distribution of the number of transactions recorded in the database; I also incorporated information for the number of stores appearing in the data on a monthly basis as a colour variable.
#Number of reports per month and num. of participating stores overlay
ggplot(data=gr_st_month_count, aes(x=month, y=count_reps)) +
stat_boxplot(geom='errorbar') + geom_jitter(alpha=0.2) +
geom_boxplot(aes(fill=part_stores), outlier.size=0) +
coord_cartesian(ylim = c(0, 1000)) +
theme(panel.background = element_rect(fill = 'white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(size = rel(1), colour = "blue")) +
labs(x="Month", y = "Transactions",
title = "Monthly Transactions between Iowa State and Liquor Stores")
We see a drastic reduction in the number of reports; a step change occurring in June. The subsequent plots presenting the same data grouped by week.
# Get weekly reports as per month methodology
gr_st_week_count <- ddply(data1, .(format(DATE, "%U"), STORE), summarise,
count_reps=length(STORE))
setnames(gr_st_week_count,1, "week")
gr_st_week_count$week <- as.numeric(gr_st_week_count$week)
gr_st_week.part_stores <- data.frame("week"=rownames(table(gr_st_week_count$week)),
"part_stores"=as.numeric(table(gr_st_week_count$week)))
gr_st_week.part_stores$week <- as.numeric(as.character(gr_st_week.part_stores$week))
gr_st_week_count <- left_join(gr_st_week_count, gr_st_week.part_stores)
#changed my mind. will make week factor
gr_st_week_count$week <- factor(gr_st_week_count$week)
rm(gr_st_week.part_stores)
#Plot: same as plot 1 but using boxplot colour as part. stores variable
ggplot(data=gr_st_week_count, aes(x=week, y=count_reps)) +
stat_boxplot(geom='errorbar') + geom_jitter(alpha=0.2) +
geom_boxplot(aes(fill=part_stores), outlier.size=0) +
coord_cartesian(ylim = c(0, 250)) +
theme(panel.background = element_rect(fill = 'white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(size = rel(1), colour = "blue")) +
labs(x="Week", y = "Transactions",
title = "Weekly Transactions between Iowa State and Liquor Stores")
Here we see that the step change occurs on the 1st week of June. Which is suspicious. Let’s zoom in on daily data for that period.
gr_st_day_count <- ddply(filter(data1, (DATE <= "2014-06-08") &
(DATE >="2014-05-25")), .(DATE, STORE),
summarise, count_reps=length(STORE))
setnames(gr_st_day_count,1, "day")
gr_st_day.part_stores <- data.frame("day"=rownames(table(gr_st_day_count$day)),
"part_stores"=as.numeric(table(gr_st_day_count$day)))
gr_st_day.part_stores$day <- as.Date(as.character(gr_st_day.part_stores$day))
gr_st_day_count <- left_join(gr_st_day_count, gr_st_day.part_stores)
rm(gr_st_day.part_stores)
ggplot(data=gr_st_day_count, aes(x=as.character(day), y=count_reps)) +
stat_boxplot(geom='errorbar') + geom_jitter(alpha=0.2) +
geom_boxplot(aes(fill=part_stores), outlier.size=0) +
coord_cartesian(ylim = c(0, 250)) +
theme(panel.background = element_rect(fill = 'white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(size = rel(1), colour = "blue")) +
labs(x="Day", y = "Transactions",
title = "Daily Transactions between Iowa State and Liquor Stores")
So it’s actually the first day of June that the number of reports reduces dramatically. I performed a quick Google search on whether anything changed drastically in terms of Iowa liquor sales legislation, or if anything major happened but to no avail. Furthermore, I believe this is not a seasonal effect (end of Univ. semesters, basketball/football end of season) because the number of reports:
Therefore I feel that this pattern has something to do with data collection/recording.
One thing I observed was that the fiscal year for the relevant Iowa state division runs from July to June. So this dataset partially contains data from two fiscal years (January to June: 2013, July to December: 2014). Could this mean something?
Anyway, I decided not to pursue this investigation any further; it is clear that the dataset has limitations and it is probably best to analyse these two periods independently (at least for time series analysis).
Before I move on to the main body of the analysis, let’s look at the distribution of transactions per weekday.
#Weekday
gr_st_weekday_count <- ddply(data1, .(weekdays(DATE), STORE), summarise,
count_reps=length(STORE))
setnames(gr_st_weekday_count,1, "weekday")
day.name <- c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")
gr_st_weekday_count$weekday <- factor(gr_st_weekday_count$weekday, levels=day.name)
gr_st_weekday.part_stores <- data.frame("weekday"=rownames(table(gr_st_weekday_count$weekday)),
"part_stores"=as.numeric(table(gr_st_weekday_count$weekday)))
gr_st_weekday.part_stores$weekday <- factor(gr_st_weekday.part_stores$weekday, levels=day.name)
gr_st_weekday_count <- left_join(gr_st_weekday_count, gr_st_weekday.part_stores)
rm(gr_st_weekday.part_stores)
#Plot data
ggplot(data=gr_st_weekday_count, aes(x=weekday, y=count_reps)) +
stat_boxplot(geom='errorbar') + geom_jitter(alpha=0.2) +
geom_boxplot(aes(fill=part_stores), outlier.size=0) +
coord_cartesian(ylim = c(0, 4500)) +
theme(panel.background = element_rect(fill = 'white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(size = rel(1), colour = "blue")) +
labs(x="Weekday", y = "Transactions",
title =
"Number of orders submitted per weekday")
Here we see that most orders are placed on Wednesdays and Thursdays.
We saw that time series analyses on the full data-set are probably not suitable.
Let’s look at various dataset properties.
Plotting the distribution of bottles sold per transaction.
ggplot(data=data1, aes(x=`BOTTLE QTY`)) +
geom_histogram(aes(y =..density..),
fill=I("blue"), col=I("black"), alpha=0.5, origin = -0.5) +
labs(title="Histogram for Bottle Quantities", x="Bottles", y="Count") +
theme(panel.background = element_rect(fill = 'white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks=element_blank(),
plot.title = element_text(size = rel(1), colour = "blue"))
Not easy to see shape of distribution. Looking at log-transform.
bks <- seq(min(data1$`BOTTLE QTY`, na.rm=T), max(data1$`BOTTLE QTY`, na.rm=T))
ggplot(data=data1, aes(x=`BOTTLE QTY`)) +
geom_histogram(aes(y =..density..),
fill=I("blue"), col=I("black"), alpha=0.5, origin = -0.5) +
labs(title="Histogram for Log Bottle Quantities", x="Bottles", y="Count") +
scale_x_log10(breaks=bks, labels=NULL) +
theme(panel.background = element_rect(fill = 'white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks=element_blank(),
plot.title = element_text(size = rel(1), colour = "blue"))
The log transform shows that the amount of bottles sold per transaction is to a large extent “quantised” (or non-uniform?). Let’s focus on the denser area of the distribution with un-transformed axes.
ggplot(data=data1, aes(x=`BOTTLE QTY`)) +
geom_histogram(aes(y =..density..),
fill=I("blue"), col=I("black"), alpha=0.5, origin = -0.5) +
labs(title="Histogram for Bottle Quantities", x="Bottles", y="Count") +
theme(panel.background = element_rect(fill = 'white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks=element_blank(),
plot.title = element_text(size = rel(1), colour = "blue")) +
scale_x_continuous(limits=c(0,50), breaks=seq(0,50,2))
So here we see that the mode of the distribution is 12; most orders are submitted for a dozen bottles of some liquor. We also see that multiples of 12 are also high occurring, as well as 6 (half-a-dozen).
Now I would like to look at total bottle sales as time-series, so let’s create appropriate variables for the period of January to May.
gr_date.Jan_May <- ddply(filter(data1, (DATE < "2014-06-01")),
.(DATE), summarise, BOTTLES = sum(`BOTTLE QTY`, na.rm=T),
BOTTLES_NORM = sum(`BOTTLE QTY`/POPULATION, na.rm=T))
The above data frame contains three variables: date, total bottles sold and normalised bottles sold. The latter variable is an attempt to adjust for the population bias. This effect arises because we aggregate data from variably populated areas in Iowa and can potentially obscure meaningful relationships in the analysis of the data. The following code rearranges the data frame a bit to help out ggplot2.
test <- gr_date.Jan_May
maxbot <- max(gr_date.Jan_May$BOTTLES, na.rm=T)
maxbotnorm <- max(gr_date.Jan_May$BOTTLES_NORM, na.rm=T)
test<- arrange(gather(test, "type", "bottles", 2:3), DATE)
test <- mutate(test, maxbottles =ifelse(type=="BOTTLES", maxbot, maxbotnorm))
gr_date.Jan_May <- test
rm(test)
And the plot follows..
ggplot(data=gr_date.Jan_May,aes(x=DATE, y=bottles/maxbottles, colour=type)) +
geom_point()+ geom_line() +
geom_ribbon(aes(ymin=0, ymax=bottles/maxbottles, fill=type)) +
scale_fill_brewer(palette="Set3")+
labs(x="Date", y = "Bottles")+
theme(panel.background = element_rect(fill = 'white'),
plot.title = element_text(size = rel(1), colour = "blue"))
When we look at the raw time series of the quantity of bottles ordered, we see some type of seasonality but it is not easy to interpret. However, once we attenuate the population effect, we see a very stable weekly pattern spiking on a single weekday. Let’s look at this data in more detail.
p1 <- ggplot(data=filter(gr_date.Jan_May, type=="BOTTLES_NORM"),
aes(x=factor(weekdays(DATE), levels=day.names),
y=bottles)) +geom_boxplot()+
labs(x="Weekday", y="Bottles (pop bias removed)")
p2 <- ggplot(data=filter(gr_date.Jan_May, type=="BOTTLES_NORM"),
aes(x=format(DATE, "%U"),
y=bottles)) +geom_boxplot()+
labs(x="Week", y="Bottles (pop bias removed)")
p3 <- ggplot(data=filter(gr_date.Jan_May, type=="BOTTLES_NORM"),
aes(x=factor(months(DATE), levels=month.name),
y=bottles)) +geom_boxplot()+
labs(x="Month", y="Bottles (pop bias removed)")
multiplot(p1, p2, p3, cols=2)
rm(p1,p2,p3)
So in the weekday plot we see more clearly this stable pattern in quantities of bottles ordered per citizen. Tuesday is the “spike” day, Monday and Wednesday are pretty much the same; after Wednesday the quantities decline. In my mind this pattern makes sense: Monday is the first day of the week and the stores calculate their weekly requirements by the end of the day. They then submit the order either at end of business or early Tuesday; this allows time for the bottles to be delivered prior to the weekend. They shouldn’t need to order too many bottles after the early week’s budgeting.
It is a bit counter-intuitive though that in a previous plot we saw that the number of orders per weekday (as opposed to quantity of bottles ordered per weekday) was greatest on Wednesday’s and Thursday’s. Maybe if I had corrected for the population bias, a similar trend would have been revealed…
One last note on the previous weekday analysis. Let’s plot the raw and population-adjusted quantities against each other and see what we get. I hope it will show a bit more on how the normalisation helped the analysis.
ggplot(data=gr_date.Jan_May,aes(x=bottles[type=="BOTTLES"]/maxbottles[type=="BOTTLES"],
y=bottles[type=="BOTTLES_NORM"]/maxbottles[type=="BOTTLES_NORM"],
shape=weekdays(DATE[type=="BOTTLES"]),
colour=weekdays(DATE[type=="BOTTLES"]))) +
geom_point(size=3) + labs(x="raw quantities", y="population adjusted quantities")
I like this plot because it looks like a machine learning exercise.. We see here that each day forms an (almost) discrete cluster, with Tuesdays and Fridays having maximum separation from the main body of observations; however Fridays would belong to a high R-squared fitter regression line of all observations excluding Tuesdays. So I think the real outlier here is Tuesdays, in the sense that the normalisation added high information value for that weekday.
In this small section I will group data per item (liquor) and plot the relationship between number of bottles ordered and bottle price.
We’ve already generated the relevant data frame gr_item.
Here is the raw plot…
#plot raw
ggplot(data=gr_item, aes(x=price, y=bottles)) + geom_point()
We don’t see too much. In these cases it is often helpful to log-transform axes.
#plot log-transformed
#good relationship
ggplot(data=gr_item, aes(x=price, y=bottles)) + geom_point(alpha=0.3) +
scale_y_log10() + scale_x_log10()+ geom_smooth(fill="red")
Let’s focus on the main body of observations. The function that adds the annotation to the following plot was taken from stackoverflow..
#Define lm_eqn function that adds model annotation
lm_eqn = function(m) {
l <- list(a = format(coef(m)[1], digits = 2),
b = format(abs(coef(m)[2]), digits = 2),
r2 = format(summary(m)$r.squared, digits = 3));
if (coef(m)[2] >= 0) {
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,l)
} else {
eq <- substitute(italic(y) == a - b %.% italic(x)*","~~italic(r)^2~"="~r2,l)
}
as.character(as.expression(eq));
}
#Plot
ggplot(data=gr_item, aes(x=price, y=bottles)) + geom_point(alpha=0.3) +
scale_y_log10() + scale_x_log10(limits=c(5,250))+ geom_smooth(method="lm",fill="red") +
geom_text(x = log10(150), y = log10(1e+05),
label = lm_eqn(lm(log10(bottles)~log10(price),gr_item)),
parse = TRUE)
There is a a weak association (R-squared 0.13) between bottle price and quantities ordered. Therefore, from the plots above one cannot detect a simple function that links price and demand, however it is obvious that the relationship is inverse.
In this small section I would like to look at order quantities as time series (note: using full dataset for this part). Let’s calculate our data frame first, where we group by month and liquor category, calculate total bottles ordered per category/month and derive from this a relevant normalised variable representing percentages over total bottles ordered.
gr_date_cat2 <- ddply(data1, .(DATE, `CATEGORY NAME`), summarise,
BOTTLES = sum(`BOTTLE QTY`, na.rm=T))
gr_date_cat2.month <- ddply(gr_date_cat2, .(factor(months(DATE), levels=month.name),
`CATEGORY NAME`), summarise,
BOTTLES = sum(`BOTTLES`, na.rm=T))
setnames(gr_date_cat2.month, 1, "Month")
gr_date_cat3 <- ddply(gr_date_cat2.month, .(Month), summarise,
TOT_BOTT = sum(BOTTLES, na.rm=T))
gr_date_cat2.month <- left_join(gr_date_cat2.month, gr_date_cat3)
gr_date_cat2.month$BCAT_NORM <- gr_date_cat2.month$BOTTLES/gr_date_cat2.month$TOT_BOTT
setnames(gr_date_cat2.month, 2, "CATEGORY")
# Some necessary manipulation for ggplot2 intricacies
dec_dates <- decimal_date(as.Date(c("2014-01-01", "2014-02-01", "2014-03-01",
"2014-04-01", "2014-05-01", "2014-06-01",
"2014-07-01", "2014-08-01", "2014-09-01",
"2014-10-01", "2014-11-01", "2014-12-01")))
gr_date_cat2.month <- mutate(
gr_date_cat2.month, month_nom = ifelse(Month=="January", dec_dates[1], ifelse(Month=="February", dec_dates[2], ifelse(Month=="March", dec_dates[3], ifelse(Month=="April", dec_dates[4], ifelse(Month=="May", dec_dates[5], ifelse(Month=="June", dec_dates[6], ifelse(Month=="July", dec_dates[7], ifelse(Month=="August", dec_dates[8], ifelse(Month=="September", dec_dates[9], ifelse(Month=="October", dec_dates[10], ifelse(Month=="November", dec_dates[11], ifelse(Month=="December", dec_dates[12], 0)))))))))))))
gr_date_cat2.month[gr_date_cat2.month$CATEGORY=="",]$CATEGORY <- "UNKNOWN"
In the subsequent plots I show time series of normalised bottle quantities of orders for each liquor category per month; I have added a colour overlay of season (winter, spring,…) in order to discern seasonal patterns more easily.
# Settings for ggplot2
txmin <- decimal_date(as.Date(c("2014-01-01", "2014-03-01", "2014-06-01","2014-09-01", "2014-12-01")))
txmax <- c(decimal_date(as.Date(c("2014-03-01", "2014-06-01", "2014-09-01","2014-12-01"))), Inf)
tymin <- rep(-Inf, 5)
tymax <- rep(Inf, 5)
season= c("1.winter", "2.spring","3.summer","4.autumn","1.winter")
rects <- data.frame(txmin,txmax,tymin,tymax,season)
rects[,5] <- sapply(rects[,5], as.character)
colScale <- c(col2hcl("blue"), col2hcl("green"), col2hcl("red"), col2hcl("purple"), col2hcl("blue"))
#Plotting
ggplot(data=gr_date_cat2.month) +
geom_rect(data=rects, aes(xmin=txmin,xmax=txmax,ymin=tymin,ymax=tymax,
fill=season), alpha=0.4)+
scale_fill_manual(values=colScale)+
geom_line(aes(x=month_nom, y=BCAT_NORM, group=1))+
facet_wrap(~CATEGORY, scales="free_y")+
labs(x="Month", y = "Normalised bottle sales",
title = "Liquor categories bottle sales with season overlay") +
theme(panel.background = element_rect(fill = 'white'),
strip.text = element_text(size=7),
strip.background = element_rect(fill="white"),
axis.text.y = element_blank(),
axis.text.x = element_blank(),
axis.ticks=element_blank(),
plot.title = element_text(size = rel(1), colour = "blue"))
A few (qualitative) observations can be made. We can identify that certain liquor types have a demand curve that only peaks in December, for example liqueurs with various flavours like crème de menthe/cacao/etc., brandies and some types of schnapps (cinnamon, butterscotch). Given that these types of liquors have a heavier taste and are generally considered “warming”, their demand curve makes sense - we identified the “winter drinks”.
Another observation that makes sense, is the very similar curve shape of tequila and triple sec; two of the margarita’s ingredients (on the side I plotted respective bottle quantities to see if the 2:1 margarita ratio is reflected in bottle quantities - it didn’t). This demand pattern is there for a few of the liquors (flavoured rums, imported vodka), where bottle quantities are clearly correlated with expected average temperatures (demand lowest in winter, climbs gradually in spring, peaks between mid-spring/mid-summer, gradually descends in autumn). As opposed to the previous category, these are the “summer drinks”.
There is a lot of work that can still be performed here with good potential for hidden insights. Just by looking at the time series we can identify interesting patterns which don’t make immediate sense. For example:
As a final insight in this section, let’s look at time series correlations between categories.
In order to pretty-print the correlation table, I will use the great corstarsl function taken from here.
#Took this function that makes nice correlation tables from here
#http://myowelt.blogspot.co.uk/2008/04/beautiful-correlation-tables-in-r.html
corstarsl <- function(x){
require(Hmisc)
x <- as.matrix(x)
R <- rcorr(x)$r
p <- rcorr(x)$P
## define notions for significance levels; spacing is important.
mystars <- ifelse(p < .001, "***", ifelse(p < .01, "** ", ifelse(p < .05, "* ", " ")))
## trunctuate the matrix that holds the correlations to two decimal
R <- format(round(cbind(rep(-1.11, ncol(x)), R), 2))[,-1]
## build a new matrix that includes the correlations with their apropriate stars
Rnew <- matrix(paste(R, mystars, sep=""), ncol=ncol(x))
diag(Rnew) <- paste(diag(R), " ", sep="")
rownames(Rnew) <- colnames(x)
colnames(Rnew) <- paste(colnames(x), "", sep="")
## remove upper triangle
Rnew <- as.matrix(Rnew)
Rnew[upper.tri(Rnew, diag = TRUE)] <- ""
Rnew <- as.data.frame(Rnew)
## remove last column and return the matrix (which is now a data frame)
Rnew <- cbind(Rnew[1:length(Rnew)-1])
return(Rnew)
}
And the table:
print.xtable(xtable(corstarsl(gr_date_cat.spread[,c(3:70)])), comment=F, type="html", size=1)
| 100 PROOF VODKA | 80 PROOF VODKA | AMERICAN ALCOHOL | AMERICAN AMARETTO | AMERICAN COCKTAILS | AMERICAN DRY GINS | AMERICAN GRAPE BRANDIES | AMERICAN SLOE GINS | ANISETTE | APPLE SCHNAPPS | APRICOT BRANDIES | BARBADOS RUM | BLACKBERRY BRANDIES | BLENDED WHISKIES | BOTTLED IN BOND BOURBON | BUTTERSCOTCH SCHNAPPS | CANADIAN WHISKIES | CHERRY BRANDIES | CINNAMON SCHNAPPS | COFFEE LIQUEURS | CREAM LIQUEURS | CREME DE ALMOND | DARK CREME DE CACAO | DECANTERS & SPECIALTY PACKAGES | DISTILLED SPIRITS SPECIALTY | FLAVORED GIN | FLAVORED RUM | FLAVORED VODKA | GRAPE SCHNAPPS | GREEN CREME DE MENTHE | HIGH PROOF BEER | IMPORTED AMARETTO | IMPORTED DRY GINS | IMPORTED GRAPE BRANDIES | IMPORTED SCHNAPPS | IMPORTED VODKA | IMPORTED VODKA - MISC | IRISH WHISKIES | JAMAICA RUM | LOW PROOF VODKA | MISC. AMERICAN CORDIALS & LIQUEURS | MISC. IMPORTED CORDIALS & LIQUEURS | MISCELLANEOUS BRANDIES | MISCELLANEOUS SCHNAPPS | OTHER PROOF VODKA | PEACH BRANDIES | PEACH SCHNAPPS | PEPPERMINT SCHNAPPS | PUERTO RICO & VIRGIN ISLANDS RUM | RASPBERRY SCHNAPPS | ROCK & RYE | ROOT BEER SCHNAPPS | SCOTCH WHISKIES | SINGLE BARREL BOURBON WHISKIES | SINGLE MALT SCOTCH | SPEARMINT SCHNAPPS | SPICED RUM | STRAIGHT BOURBON WHISKIES | STRAIGHT RYE WHISKIES | STRAWBERRY SCHNAPPS | TENNESSEE WHISKIES | TEQUILA | TRIPLE SEC | TROPICAL FRUIT SCHNAPPS | WATERMELON SCHNAPPS | WHISKEY LIQUEUR | WHITE CREME DE CACAO | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 100 PROOF VODKA | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 80 PROOF VODKA | 0.65*** | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| AMERICAN ALCOHOL | 0.38*** | 0.39*** | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| AMERICAN AMARETTO | 0.65*** | 0.83*** | 0.53*** | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| AMERICAN COCKTAILS | 0.49*** | 0.76*** | 0.20** | 0.65*** | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| AMERICAN DRY GINS | 0.46*** | 0.57*** | 0.20** | 0.40*** | 0.37*** | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| AMERICAN GRAPE BRANDIES | 0.40*** | 0.57*** | 0.14 | 0.47*** | 0.34*** | 0.78*** | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| AMERICAN SLOE GINS | 0.33*** | 0.50*** | 0.26*** | 0.50*** | 0.50*** | 0.19** | 0.25*** | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ANISETTE | 0.35*** | 0.44*** | 0.21** | 0.38*** | 0.29*** | 0.31*** | 0.27*** | 0.16* | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| APPLE SCHNAPPS | 0.54*** | 0.73*** | 0.38*** | 0.71*** | 0.58*** | 0.47*** | 0.55*** | 0.40*** | 0.36*** | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| APRICOT BRANDIES | 0.42*** | 0.59*** | 0.43*** | 0.65*** | 0.47*** | 0.14* | 0.14* | 0.49*** | 0.19* | 0.39*** | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| BARBADOS RUM | 0.01 | 0.06 | -0.05 | 0.00 | 0.01 | 0.01 | 0.04 | -0.03 | 0.02 | 0.00 | -0.05 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| BLACKBERRY BRANDIES | 0.44*** | 0.61*** | 0.29*** | 0.64*** | 0.48*** | 0.13 | 0.14 | 0.48*** | 0.23** | 0.39*** | 0.81*** | -0.06 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| BLENDED WHISKIES | 0.64*** | 0.89*** | 0.46*** | 0.85*** | 0.68*** | 0.54*** | 0.59*** | 0.57*** | 0.43*** | 0.81*** | 0.54*** | 0.04 | 0.56*** | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| BOTTLED IN BOND BOURBON | 0.26*** | 0.24*** | 0.16* | 0.28*** | 0.14 | 0.36*** | 0.39*** | 0.01 | 0.21* | 0.20** | 0.15* | 0.01 | 0.11 | 0.30*** | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| BUTTERSCOTCH SCHNAPPS | 0.47*** | 0.67*** | 0.49*** | 0.67*** | 0.41*** | 0.30*** | 0.33*** | 0.33*** | 0.35*** | 0.71*** | 0.52*** | -0.04 | 0.51*** | 0.68*** | 0.08 | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| CANADIAN WHISKIES | 0.51*** | 0.79*** | 0.41*** | 0.71*** | 0.66*** | 0.33*** | 0.35*** | 0.50*** | 0.29*** | 0.51*** | 0.72*** | 0.03 | 0.70*** | 0.69*** | 0.19** | 0.53*** | |||||||||||||||||||||||||||||||||||||||||||||||||||
| CHERRY BRANDIES | 0.33*** | 0.54*** | 0.27*** | 0.50*** | 0.45*** | 0.12 | 0.14 | 0.45*** | 0.18* | 0.38*** | 0.58*** | -0.02 | 0.60*** | 0.49*** | 0.14 | 0.48*** | 0.62*** | ||||||||||||||||||||||||||||||||||||||||||||||||||
| CINNAMON SCHNAPPS | 0.54*** | 0.71*** | 0.52*** | 0.69*** | 0.45*** | 0.24*** | 0.28*** | 0.43*** | 0.34*** | 0.61*** | 0.64*** | -0.03 | 0.60*** | 0.66*** | 0.09 | 0.82*** | 0.61*** | 0.54*** | |||||||||||||||||||||||||||||||||||||||||||||||||
| COFFEE LIQUEURS | 0.58*** | 0.74*** | 0.58*** | 0.75*** | 0.49*** | 0.36*** | 0.42*** | 0.48*** | 0.29*** | 0.62*** | 0.61*** | 0.02 | 0.60*** | 0.73*** | 0.21** | 0.68*** | 0.70*** | 0.50*** | 0.71*** | ||||||||||||||||||||||||||||||||||||||||||||||||
| CREAM LIQUEURS | 0.41*** | 0.60*** | 0.53*** | 0.64*** | 0.32*** | 0.24*** | 0.33*** | 0.33*** | 0.30*** | 0.49*** | 0.53*** | 0.04 | 0.53*** | 0.63*** | 0.19** | 0.64*** | 0.59*** | 0.40*** | 0.62*** | 0.72*** | |||||||||||||||||||||||||||||||||||||||||||||||
| CREME DE ALMOND | 0.34** | 0.08 | 0.21 | 0.39*** | -0.02 | -0.05 | 0.00 | 0.06 | -0.02 | 0.14 | 0.26* | -0.07 | 0.45*** | 0.14 | -0.18 | 0.15 | 0.08 | -0.01 | 0.30* | 0.34** | 0.31* | ||||||||||||||||||||||||||||||||||||||||||||||
| DARK CREME DE CACAO | 0.28*** | 0.55*** | 0.37*** | 0.59*** | 0.39*** | 0.21** | 0.28*** | 0.37*** | 0.28*** | 0.55*** | 0.45*** | -0.06 | 0.40*** | 0.55*** | 0.08 | 0.55*** | 0.51*** | 0.41*** | 0.55*** | 0.59*** | 0.53*** | 0.25* | |||||||||||||||||||||||||||||||||||||||||||||
| DECANTERS & SPECIALTY PACKAGES | -0.10 | -0.16* | 0.08 | -0.12 | -0.20** | -0.18* | -0.17* | -0.13 | 0.07 | -0.16* | -0.14 | -0.04 | -0.15* | -0.10 | -0.07 | -0.06 | -0.07 | -0.11 | -0.09 | -0.07 | 0.11 | -0.19 | -0.08 | ||||||||||||||||||||||||||||||||||||||||||||
| DISTILLED SPIRITS SPECIALTY | 0.41*** | 0.59*** | 0.24*** | 0.48*** | 0.43*** | 0.44*** | 0.40*** | 0.24*** | 0.29*** | 0.49*** | 0.25*** | -0.06 | 0.27*** | 0.54*** | 0.14 | 0.37*** | 0.47*** | 0.29*** | 0.38*** | 0.50*** | 0.31*** | 0.01 | 0.34*** | -0.28*** | |||||||||||||||||||||||||||||||||||||||||||
| FLAVORED GIN | 0.54*** | 0.68*** | 0.31*** | 0.65*** | 0.51*** | 0.50*** | 0.57*** | 0.41*** | 0.32*** | 0.60*** | 0.38*** | 0.04 | 0.36*** | 0.72*** | 0.23** | 0.48*** | 0.50*** | 0.28*** | 0.49*** | 0.57*** | 0.44*** | 0.11 | 0.32*** | -0.10 | 0.42*** | ||||||||||||||||||||||||||||||||||||||||||
| FLAVORED RUM | 0.53*** | 0.81*** | 0.24*** | 0.76*** | 0.81*** | 0.52*** | 0.57*** | 0.46*** | 0.32*** | 0.66*** | 0.45*** | 0.06 | 0.45*** | 0.79*** | 0.24** | 0.47*** | 0.68*** | 0.43*** | 0.50*** | 0.57*** | 0.42*** | 0.01 | 0.46*** | -0.08 | 0.50*** | 0.63*** | |||||||||||||||||||||||||||||||||||||||||
| FLAVORED VODKA | 0.52*** | 0.76*** | 0.25*** | 0.64*** | 0.63*** | 0.61*** | 0.70*** | 0.31*** | 0.30*** | 0.62*** | 0.24*** | 0.09 | 0.30*** | 0.70*** | 0.24** | 0.44*** | 0.48*** | 0.21** | 0.43*** | 0.55*** | 0.41*** | 0.10 | 0.36*** | -0.11 | 0.43*** | 0.64*** | 0.72*** | ||||||||||||||||||||||||||||||||||||||||
| GRAPE SCHNAPPS | 0.35*** | 0.47*** | 0.04 | 0.38*** | 0.34*** | 0.60*** | 0.72*** | 0.16* | 0.23** | 0.62*** | -0.05 | 0.01 | -0.03 | 0.51*** | 0.23** | 0.35*** | 0.13 | 0.03 | 0.23** | 0.33*** | 0.23** | -0.03 | 0.28*** | -0.18* | 0.44*** | 0.44*** | 0.44*** | 0.58*** | |||||||||||||||||||||||||||||||||||||||
| GREEN CREME DE MENTHE | 0.41*** | 0.42*** | 0.29*** | 0.58*** | 0.26*** | 0.22** | 0.28*** | 0.34*** | 0.12 | 0.43*** | 0.46*** | -0.05 | 0.62*** | 0.51*** | 0.15 | 0.42*** | 0.33*** | 0.22** | 0.47*** | 0.45*** | 0.45*** | 0.74*** | 0.39*** | -0.17* | 0.16* | 0.33*** | 0.32*** | 0.30*** | 0.24*** | ||||||||||||||||||||||||||||||||||||||
| HIGH PROOF BEER | -0.29 | -0.65 | NaNNA | -0.67 | 0.18 | 0.12 | 0.00 | -0.55 | -0.50 | -0.35 | -0.75 | 0.37 | -0.63 | -0.52 | 0.31 | -0.99** | -0.40 | -0.83 | -0.97* | -0.94 | -0.99** | NaNNA | -0.95* | -0.15 | 0.38 | -0.23 | -0.39 | -0.20 | 0.30 | NaNNA | |||||||||||||||||||||||||||||||||||||
| IMPORTED AMARETTO | 0.45*** | 0.62*** | 0.43*** | 0.64*** | 0.43*** | 0.24*** | 0.34*** | 0.35*** | 0.16* | 0.59*** | 0.43*** | 0.05 | 0.46*** | 0.67*** | 0.16* | 0.58*** | 0.51*** | 0.37*** | 0.56*** | 0.62*** | 0.55*** | 0.27* | 0.45*** | -0.14 | 0.33*** | 0.46*** | 0.52*** | 0.45*** | 0.30*** | 0.44*** | -0.81 | ||||||||||||||||||||||||||||||||||||
| IMPORTED DRY GINS | 0.39*** | 0.69*** | 0.17* | 0.56*** | 0.63*** | 0.42*** | 0.49*** | 0.34*** | 0.33*** | 0.59*** | 0.19** | 0.06 | 0.18* | 0.71*** | 0.12 | 0.44*** | 0.48*** | 0.22** | 0.41*** | 0.44*** | 0.40*** | -0.14 | 0.41*** | -0.05 | 0.45*** | 0.55*** | 0.74*** | 0.62*** | 0.54*** | 0.16* | -0.85 | 0.45*** | |||||||||||||||||||||||||||||||||||
| IMPORTED GRAPE BRANDIES | 0.38*** | 0.51*** | 0.12 | 0.40*** | 0.38*** | 0.70*** | 0.91*** | 0.30*** | 0.16 | 0.48*** | 0.12 | 0.04 | 0.13 | 0.54*** | 0.37*** | 0.23*** | 0.35*** | 0.11 | 0.19** | 0.37*** | 0.26*** | -0.09 | 0.18* | -0.13 | 0.35*** | 0.57*** | 0.56*** | 0.65*** | 0.65*** | 0.21** | 0.26 | 0.28*** | 0.50*** | ||||||||||||||||||||||||||||||||||
| IMPORTED SCHNAPPS | 0.59*** | 0.86*** | 0.42*** | 0.83*** | 0.65*** | 0.45*** | 0.51*** | 0.51*** | 0.38*** | 0.79*** | 0.51*** | 0.05 | 0.53*** | 0.92*** | 0.17* | 0.70*** | 0.65*** | 0.46*** | 0.67*** | 0.74*** | 0.63*** | 0.06 | 0.56*** | -0.14 | 0.55*** | 0.69*** | 0.74*** | 0.68*** | 0.52*** | 0.47*** | -0.80 | 0.69*** | 0.70*** | 0.48*** | |||||||||||||||||||||||||||||||||
| IMPORTED VODKA | 0.52*** | 0.82*** | 0.32*** | 0.69*** | 0.63*** | 0.56*** | 0.60*** | 0.40*** | 0.43*** | 0.67*** | 0.29*** | 0.07 | 0.33*** | 0.82*** | 0.22** | 0.49*** | 0.63*** | 0.34*** | 0.48*** | 0.61*** | 0.48*** | -0.06 | 0.48*** | -0.01 | 0.58*** | 0.65*** | 0.81*** | 0.72*** | 0.57*** | 0.30*** | -0.64 | 0.54*** | 0.81*** | 0.59*** | 0.81*** | ||||||||||||||||||||||||||||||||
| IMPORTED VODKA - MISC | 0.43*** | 0.66*** | 0.14 | 0.56*** | 0.56*** | 0.58*** | 0.71*** | 0.23*** | 0.34*** | 0.62*** | 0.07 | 0.05 | 0.12 | 0.68*** | 0.18* | 0.37*** | 0.39*** | 0.15* | 0.30*** | 0.44*** | 0.32*** | -0.05 | 0.35*** | -0.02 | 0.51*** | 0.61*** | 0.69*** | 0.74*** | 0.71*** | 0.22** | -0.04 | 0.39*** | 0.77*** | 0.72*** | 0.69*** | 0.81*** | |||||||||||||||||||||||||||||||
| IRISH WHISKIES | 0.36*** | 0.50*** | 0.28*** | 0.45*** | 0.23*** | 0.58*** | 0.48*** | 0.21** | 0.32*** | 0.45*** | 0.25*** | -0.01 | 0.33*** | 0.57*** | 0.20** | 0.38*** | 0.33*** | 0.23** | 0.31*** | 0.49*** | 0.48*** | 0.06 | 0.34*** | -0.09 | 0.44*** | 0.45*** | 0.42*** | 0.40*** | 0.45*** | 0.41*** | -0.63 | 0.38*** | 0.37*** | 0.42*** | 0.57*** | 0.53*** | 0.42*** | ||||||||||||||||||||||||||||||
| JAMAICA RUM | 0.47*** | 0.60*** | 0.30*** | 0.59*** | 0.42*** | 0.43*** | 0.50*** | 0.31*** | 0.35*** | 0.55*** | 0.32*** | 0.10 | 0.29*** | 0.64*** | 0.29*** | 0.50*** | 0.40*** | 0.24*** | 0.42*** | 0.55*** | 0.38*** | 0.00 | 0.33*** | -0.06 | 0.41*** | 0.57*** | 0.57*** | 0.52*** | 0.48*** | 0.27*** | -0.84 | 0.51*** | 0.56*** | 0.49*** | 0.65*** | 0.60*** | 0.55*** | 0.36*** | |||||||||||||||||||||||||||||
| LOW PROOF VODKA | 0.35*** | 0.49*** | 0.10 | 0.50*** | 0.45*** | 0.21** | 0.17* | 0.36*** | 0.24** | 0.28*** | 0.51*** | 0.01 | 0.57*** | 0.45*** | 0.25*** | 0.28*** | 0.56*** | 0.42*** | 0.34*** | 0.39*** | 0.29*** | 0.16 | 0.34*** | -0.10 | 0.15* | 0.35*** | 0.48*** | 0.26*** | 0.08 | 0.34*** | -0.29 | 0.38*** | 0.31*** | 0.20** | 0.40*** | 0.35*** | 0.19** | 0.28*** | 0.35*** | ||||||||||||||||||||||||||||
| MISC. AMERICAN CORDIALS & LIQUEURS | 0.39*** | 0.65*** | 0.39*** | 0.47*** | 0.37*** | 0.47*** | 0.37*** | 0.35*** | 0.31*** | 0.51*** | 0.39*** | 0.01 | 0.40*** | 0.59*** | 0.24** | 0.52*** | 0.57*** | 0.48*** | 0.55*** | 0.54*** | 0.51*** | -0.04 | 0.38*** | 0.06 | 0.41*** | 0.39*** | 0.44*** | 0.42*** | 0.30*** | 0.20** | -0.91 | 0.40*** | 0.33*** | 0.32*** | 0.56*** | 0.51*** | 0.34*** | 0.41*** | 0.35*** | 0.27*** | |||||||||||||||||||||||||||
| MISC. IMPORTED CORDIALS & LIQUEURS | 0.53*** | 0.82*** | 0.30*** | 0.73*** | 0.63*** | 0.52*** | 0.60*** | 0.36*** | 0.47*** | 0.76*** | 0.35*** | 0.05 | 0.41*** | 0.84*** | 0.22** | 0.61*** | 0.65*** | 0.40*** | 0.55*** | 0.66*** | 0.56*** | -0.03 | 0.59*** | -0.11 | 0.60*** | 0.62*** | 0.75*** | 0.68*** | 0.62*** | 0.36*** | -0.89 | 0.58*** | 0.80*** | 0.54*** | 0.85*** | 0.87*** | 0.78*** | 0.54*** | 0.65*** | 0.40*** | 0.49*** | ||||||||||||||||||||||||||
| MISCELLANEOUS BRANDIES | 0.10 | 0.17* | 0.15 | 0.08 | 0.05 | 0.12 | 0.11 | 0.14 | 0.06 | 0.12 | 0.14 | 0.09 | 0.20** | 0.16* | -0.01 | 0.24** | 0.09 | 0.13 | 0.21** | 0.23** | 0.15* | -0.05 | 0.11 | 0.19* | 0.15* | 0.02 | 0.09 | 0.06 | 0.13 | 0.15* | 0.21 | 0.14 | -0.02 | 0.11 | 0.15* | 0.06 | 0.05 | 0.14 | 0.13 | 0.03 | 0.34*** | 0.09 | |||||||||||||||||||||||||
| MISCELLANEOUS SCHNAPPS | 0.42*** | 0.60*** | 0.29*** | 0.54*** | 0.38*** | 0.31*** | 0.38*** | 0.32*** | 0.25** | 0.55*** | 0.33*** | 0.05 | 0.34*** | 0.67*** | 0.06 | 0.52*** | 0.42*** | 0.29*** | 0.50*** | 0.52*** | 0.50*** | 0.10 | 0.40*** | -0.13 | 0.46*** | 0.55*** | 0.50*** | 0.51*** | 0.37*** | 0.29*** | -0.85 | 0.51*** | 0.53*** | 0.35*** | 0.66*** | 0.58*** | 0.51*** | 0.39*** | 0.48*** | 0.23** | 0.44*** | 0.58*** | 0.13 | ||||||||||||||||||||||||
| OTHER PROOF VODKA | 0.39*** | 0.42*** | 0.40*** | 0.45*** | 0.26*** | 0.35*** | 0.26*** | 0.24** | 0.06 | 0.42*** | 0.28*** | 0.02 | 0.38*** | 0.46*** | 0.15 | 0.38*** | 0.32*** | 0.29*** | 0.35*** | 0.40*** | 0.26*** | 0.06 | 0.12 | -0.14 | 0.14 | 0.47*** | 0.25** | 0.35*** | 0.22** | 0.50*** | NaNNA | 0.29*** | 0.16* | 0.36*** | 0.50*** | 0.29*** | 0.25** | 0.38*** | 0.31*** | 0.19* | 0.35*** | 0.29*** | 0.12 | 0.34*** | |||||||||||||||||||||||
| PEACH BRANDIES | -0.28*** | -0.34*** | -0.09 | -0.38*** | -0.36*** | -0.20** | -0.11 | -0.20** | -0.13 | -0.31*** | -0.24*** | 0.04 | -0.24*** | -0.36*** | -0.17* | -0.27*** | -0.32*** | -0.25*** | -0.26*** | -0.29*** | -0.21** | 0.01 | -0.18* | 0.08 | -0.25*** | -0.28*** | -0.34*** | -0.26*** | -0.28*** | -0.16* | -0.35 | -0.27*** | -0.32*** | -0.17* | -0.39*** | -0.32*** | -0.26*** | -0.25*** | -0.30*** | -0.32*** | -0.23** | -0.35*** | -0.02 | -0.24*** | -0.15 | ||||||||||||||||||||||
| PEACH SCHNAPPS | 0.57*** | 0.77*** | 0.29*** | 0.70*** | 0.67*** | 0.47*** | 0.51*** | 0.49*** | 0.33*** | 0.83*** | 0.42*** | 0.01 | 0.43*** | 0.79*** | 0.14 | 0.74*** | 0.56*** | 0.46*** | 0.68*** | 0.63*** | 0.46*** | 0.09 | 0.52*** | -0.16* | 0.50*** | 0.61*** | 0.72*** | 0.64*** | 0.61*** | 0.39*** | -0.75 | 0.59*** | 0.65*** | 0.48*** | 0.78*** | 0.68*** | 0.61*** | 0.44*** | 0.60*** | 0.36*** | 0.53*** | 0.75*** | 0.20** | 0.55*** | 0.37*** | -0.35*** | |||||||||||||||||||||
| PEPPERMINT SCHNAPPS | 0.57*** | 0.77*** | 0.49*** | 0.82*** | 0.65*** | 0.22** | 0.26*** | 0.52*** | 0.34*** | 0.61*** | 0.77*** | -0.02 | 0.74*** | 0.75*** | 0.15* | 0.71*** | 0.76*** | 0.59*** | 0.76*** | 0.75*** | 0.69*** | 0.26* | 0.59*** | -0.07 | 0.38*** | 0.51*** | 0.61*** | 0.49*** | 0.15* | 0.47*** | -0.56 | 0.59*** | 0.42*** | 0.21** | 0.73*** | 0.52*** | 0.35*** | 0.31*** | 0.46*** | 0.49*** | 0.46*** | 0.61*** | 0.15* | 0.52*** | 0.28*** | -0.31*** | 0.60*** | ||||||||||||||||||||
| PUERTO RICO & VIRGIN ISLANDS RUM | 0.65*** | 0.91*** | 0.40*** | 0.82*** | 0.73*** | 0.56*** | 0.58*** | 0.52*** | 0.35*** | 0.71*** | 0.53*** | 0.02 | 0.53*** | 0.88*** | 0.23** | 0.60*** | 0.81*** | 0.48*** | 0.63*** | 0.72*** | 0.58*** | 0.15 | 0.55*** | -0.13 | 0.58*** | 0.69*** | 0.83*** | 0.74*** | 0.48*** | 0.42*** | -0.73 | 0.61*** | 0.76*** | 0.55*** | 0.86*** | 0.86*** | 0.70*** | 0.49*** | 0.60*** | 0.50*** | 0.55*** | 0.88*** | 0.08 | 0.61*** | 0.36*** | -0.38*** | 0.74*** | 0.72*** | |||||||||||||||||||
| RASPBERRY SCHNAPPS | 0.44*** | 0.63*** | 0.32*** | 0.64*** | 0.45*** | 0.38*** | 0.42*** | 0.44*** | 0.21* | 0.58*** | 0.37*** | 0.05 | 0.35*** | 0.66*** | 0.12 | 0.51*** | 0.45*** | 0.34*** | 0.48*** | 0.48*** | 0.44*** | 0.13 | 0.40*** | -0.15* | 0.40*** | 0.57*** | 0.56*** | 0.48*** | 0.42*** | 0.40*** | -0.80 | 0.41*** | 0.56*** | 0.44*** | 0.66*** | 0.61*** | 0.54*** | 0.39*** | 0.46*** | 0.29*** | 0.36*** | 0.58*** | 0.02 | 0.54*** | 0.38*** | -0.32*** | 0.60*** | 0.50*** | 0.62*** | ||||||||||||||||||
| ROCK & RYE | 0.32*** | 0.48*** | 0.25** | 0.45*** | 0.35*** | 0.26*** | 0.22** | 0.17* | 0.15 | 0.34*** | 0.41*** | -0.05 | 0.45*** | 0.41*** | 0.23** | 0.39*** | 0.45*** | 0.44*** | 0.37*** | 0.43*** | 0.29*** | -0.01 | 0.34*** | -0.08 | 0.23** | 0.29*** | 0.34*** | 0.26*** | 0.17* | 0.27*** | -0.05 | 0.26*** | 0.25*** | 0.21** | 0.39*** | 0.34*** | 0.22** | 0.24** | 0.33*** | 0.44*** | 0.41*** | 0.42*** | 0.13 | 0.27*** | 0.27*** | -0.25*** | 0.34*** | 0.46*** | 0.42*** | 0.19* | |||||||||||||||||
| ROOT BEER SCHNAPPS | 0.38*** | 0.62*** | 0.28*** | 0.60*** | 0.59*** | 0.25*** | 0.23** | 0.37*** | 0.22** | 0.54*** | 0.44*** | 0.01 | 0.45*** | 0.62*** | 0.06 | 0.48*** | 0.51*** | 0.41*** | 0.54*** | 0.54*** | 0.42*** | 0.07 | 0.41*** | -0.15* | 0.35*** | 0.43*** | 0.54*** | 0.46*** | 0.32*** | 0.30*** | 0.36 | 0.42*** | 0.48*** | 0.20** | 0.65*** | 0.49*** | 0.38*** | 0.40*** | 0.43*** | 0.38*** | 0.38*** | 0.57*** | -0.02 | 0.37*** | 0.22** | -0.34*** | 0.60*** | 0.58*** | 0.60*** | 0.47*** | 0.34*** | ||||||||||||||||
| SCOTCH WHISKIES | 0.54*** | 0.83*** | 0.37*** | 0.75*** | 0.70*** | 0.37*** | 0.40*** | 0.48*** | 0.39*** | 0.62*** | 0.57*** | -0.01 | 0.60*** | 0.79*** | 0.17* | 0.61*** | 0.75*** | 0.53*** | 0.61*** | 0.68*** | 0.59*** | 0.07 | 0.50*** | -0.05 | 0.47*** | 0.60*** | 0.73*** | 0.60*** | 0.36*** | 0.37*** | -0.74 | 0.63*** | 0.72*** | 0.39*** | 0.76*** | 0.71*** | 0.55*** | 0.43*** | 0.58*** | 0.58*** | 0.48*** | 0.80*** | 0.17* | 0.48*** | 0.26*** | -0.39*** | 0.68*** | 0.74*** | 0.85*** | 0.53*** | 0.37*** | 0.59*** | |||||||||||||||
| SINGLE BARREL BOURBON WHISKIES | 0.12 | 0.21** | 0.17* | 0.17* | 0.19** | 0.15* | 0.26*** | 0.08 | 0.09 | 0.28*** | 0.01 | -0.02 | 0.05 | 0.27*** | 0.19* | 0.21** | 0.20** | 0.04 | 0.14* | 0.12 | 0.12 | -0.04 | 0.15* | 0.07 | 0.22** | 0.23** | 0.20** | 0.19** | 0.25*** | 0.15* | 0.75 | 0.15* | 0.32*** | 0.34*** | 0.26*** | 0.32*** | 0.36*** | 0.08 | 0.21** | -0.03 | 0.07 | 0.34*** | 0.15* | 0.16* | 0.15 | -0.10 | 0.24*** | 0.12 | 0.25*** | 0.19** | 0.09 | 0.11 | 0.24*** | ||||||||||||||
| SINGLE MALT SCOTCH | 0.50*** | 0.70*** | 0.34*** | 0.60*** | 0.48*** | 0.46*** | 0.50*** | 0.33*** | 0.39*** | 0.67*** | 0.32*** | 0.03 | 0.33*** | 0.78*** | 0.14 | 0.58*** | 0.50*** | 0.32*** | 0.49*** | 0.57*** | 0.54*** | -0.11 | 0.53*** | -0.02 | 0.48*** | 0.53*** | 0.59*** | 0.58*** | 0.49*** | 0.30*** | -0.64 | 0.56*** | 0.67*** | 0.46*** | 0.75*** | 0.74*** | 0.65*** | 0.51*** | 0.59*** | 0.30*** | 0.50*** | 0.77*** | 0.18* | 0.59*** | 0.28*** | -0.27*** | 0.64*** | 0.52*** | 0.71*** | 0.50*** | 0.42*** | 0.51*** | 0.65*** | 0.25*** | |||||||||||||
| SPEARMINT SCHNAPPS | 0.15* | 0.35*** | 0.04 | 0.29*** | 0.18* | 0.32*** | 0.38*** | 0.09 | 0.23** | 0.35*** | 0.21** | -0.01 | 0.24** | 0.32*** | 0.16* | 0.35*** | 0.26*** | 0.23** | 0.27*** | 0.26*** | 0.29*** | -0.19 | 0.26*** | -0.18* | 0.23** | 0.33*** | 0.28*** | 0.29*** | 0.38*** | 0.23** | -0.86 | 0.15* | 0.26*** | 0.33*** | 0.35*** | 0.28*** | 0.32*** | 0.38*** | 0.30*** | 0.24** | 0.18* | 0.41*** | 0.02 | 0.18* | 0.35*** | -0.11 | 0.36*** | 0.24** | 0.30*** | 0.29*** | 0.17* | 0.30*** | 0.37*** | 0.11 | 0.28*** | ||||||||||||
| SPICED RUM | 0.64*** | 0.92*** | 0.44*** | 0.83*** | 0.73*** | 0.46*** | 0.52*** | 0.55*** | 0.38*** | 0.70*** | 0.66*** | 0.05 | 0.61*** | 0.87*** | 0.21** | 0.69*** | 0.79*** | 0.57*** | 0.75*** | 0.73*** | 0.69*** | 0.06 | 0.55*** | -0.12 | 0.53*** | 0.64*** | 0.77*** | 0.69*** | 0.37*** | 0.43*** | -0.82 | 0.63*** | 0.66*** | 0.46*** | 0.86*** | 0.75*** | 0.59*** | 0.48*** | 0.55*** | 0.46*** | 0.60*** | 0.75*** | 0.11 | 0.57*** | 0.38*** | -0.34*** | 0.71*** | 0.83*** | 0.86*** | 0.61*** | 0.38*** | 0.60*** | 0.81*** | 0.19** | 0.67*** | 0.31*** | |||||||||||
| STRAIGHT BOURBON WHISKIES | 0.61*** | 0.90*** | 0.45*** | 0.80*** | 0.64*** | 0.46*** | 0.51*** | 0.47*** | 0.48*** | 0.73*** | 0.58*** | 0.02 | 0.63*** | 0.88*** | 0.28*** | 0.70*** | 0.76*** | 0.61*** | 0.73*** | 0.73*** | 0.64*** | 0.14 | 0.60*** | -0.08 | 0.57*** | 0.60*** | 0.71*** | 0.62*** | 0.45*** | 0.47*** | -0.60 | 0.67*** | 0.68*** | 0.41*** | 0.85*** | 0.79*** | 0.60*** | 0.51*** | 0.58*** | 0.48*** | 0.67*** | 0.84*** | 0.23** | 0.59*** | 0.35*** | -0.36*** | 0.75*** | 0.76*** | 0.86*** | 0.58*** | 0.50*** | 0.62*** | 0.83*** | 0.25*** | 0.73*** | 0.29*** | 0.86*** | ||||||||||
| STRAIGHT RYE WHISKIES | 0.41*** | 0.62*** | 0.33*** | 0.56*** | 0.55*** | 0.35*** | 0.39*** | 0.25*** | 0.33*** | 0.63*** | 0.29*** | -0.05 | 0.29*** | 0.62*** | 0.11 | 0.49*** | 0.37*** | 0.25*** | 0.43*** | 0.47*** | 0.37*** | 0.00 | 0.45*** | -0.17* | 0.41*** | 0.44*** | 0.52*** | 0.57*** | 0.42*** | 0.32*** | 0.08 | 0.50*** | 0.48*** | 0.33*** | 0.61*** | 0.51*** | 0.51*** | 0.38*** | 0.44*** | 0.14 | 0.27*** | 0.61*** | 0.11 | 0.42*** | 0.28*** | -0.20** | 0.57*** | 0.56*** | 0.55*** | 0.43*** | 0.24*** | 0.39*** | 0.56*** | 0.21** | 0.54*** | 0.18* | 0.58*** | 0.58*** | |||||||||
| STRAWBERRY SCHNAPPS | 0.41*** | 0.62*** | 0.26*** | 0.58*** | 0.42*** | 0.47*** | 0.49*** | 0.23** | 0.28*** | 0.58*** | 0.22** | 0.03 | 0.24*** | 0.62*** | 0.15* | 0.43*** | 0.39*** | 0.28*** | 0.38*** | 0.52*** | 0.38*** | -0.08 | 0.42*** | -0.11 | 0.51*** | 0.45*** | 0.55*** | 0.58*** | 0.54*** | 0.23** | -0.86 | 0.46*** | 0.57*** | 0.44*** | 0.65*** | 0.63*** | 0.60*** | 0.51*** | 0.53*** | 0.27*** | 0.39*** | 0.68*** | 0.09 | 0.46*** | 0.32*** | -0.31*** | 0.57*** | 0.39*** | 0.62*** | 0.47*** | 0.24** | 0.43*** | 0.58*** | 0.24*** | 0.57*** | 0.42*** | 0.54*** | 0.57*** | 0.44*** | ||||||||
| TENNESSEE WHISKIES | 0.40*** | 0.75*** | 0.32*** | 0.61*** | 0.69*** | 0.36*** | 0.38*** | 0.42*** | 0.37*** | 0.57*** | 0.41*** | 0.08 | 0.43*** | 0.68*** | 0.09 | 0.44*** | 0.57*** | 0.43*** | 0.47*** | 0.55*** | 0.54*** | 0.01 | 0.46*** | -0.15* | 0.38*** | 0.44*** | 0.63*** | 0.59*** | 0.37*** | 0.31*** | -0.02 | 0.48*** | 0.53*** | 0.33*** | 0.67*** | 0.59*** | 0.48*** | 0.40*** | 0.43*** | 0.31*** | 0.48*** | 0.60*** | 0.12 | 0.45*** | 0.27*** | -0.28*** | 0.60*** | 0.63*** | 0.66*** | 0.34*** | 0.37*** | 0.54*** | 0.66*** | 0.11 | 0.56*** | 0.29*** | 0.75*** | 0.66*** | 0.53*** | 0.49*** | |||||||
| TEQUILA | 0.54*** | 0.80*** | 0.22** | 0.66*** | 0.70*** | 0.59*** | 0.66*** | 0.39*** | 0.42*** | 0.72*** | 0.24*** | 0.08 | 0.26*** | 0.80*** | 0.18* | 0.49*** | 0.56*** | 0.29*** | 0.43*** | 0.55*** | 0.42*** | -0.07 | 0.47*** | -0.12 | 0.52*** | 0.65*** | 0.75*** | 0.75*** | 0.68*** | 0.28*** | -0.37 | 0.51*** | 0.80*** | 0.65*** | 0.79*** | 0.84*** | 0.84*** | 0.46*** | 0.59*** | 0.31*** | 0.43*** | 0.87*** | 0.04 | 0.52*** | 0.36*** | -0.28*** | 0.73*** | 0.51*** | 0.83*** | 0.60*** | 0.32*** | 0.53*** | 0.73*** | 0.34*** | 0.71*** | 0.35*** | 0.72*** | 0.72*** | 0.57*** | 0.69*** | 0.61*** | ||||||
| TRIPLE SEC | 0.42*** | 0.61*** | 0.10 | 0.49*** | 0.46*** | 0.57*** | 0.60*** | 0.17* | 0.39*** | 0.59*** | -0.01 | 0.01 | 0.05 | 0.62*** | 0.17* | 0.38*** | 0.30*** | 0.07 | 0.26*** | 0.38*** | 0.25*** | -0.03 | 0.32*** | -0.12 | 0.53*** | 0.53*** | 0.59*** | 0.60*** | 0.73*** | 0.20** | -0.63 | 0.37*** | 0.76*** | 0.55*** | 0.65*** | 0.74*** | 0.82*** | 0.47*** | 0.55*** | 0.17* | 0.28*** | 0.80*** | -0.01 | 0.43*** | 0.22** | -0.31*** | 0.58*** | 0.27*** | 0.67*** | 0.49*** | 0.20** | 0.39*** | 0.57*** | 0.30*** | 0.60*** | 0.36*** | 0.49*** | 0.58*** | 0.45*** | 0.65*** | 0.42*** | 0.86*** | |||||
| TROPICAL FRUIT SCHNAPPS | 0.44*** | 0.55*** | 0.18* | 0.59*** | 0.63*** | 0.25*** | 0.27*** | 0.51*** | 0.15 | 0.44*** | 0.45*** | 0.04 | 0.45*** | 0.57*** | 0.10 | 0.36*** | 0.49*** | 0.37*** | 0.38*** | 0.35*** | 0.30*** | 0.01 | 0.27*** | -0.16* | 0.31*** | 0.54*** | 0.66*** | 0.45*** | 0.19** | 0.28*** | -0.32 | 0.36*** | 0.46*** | 0.29*** | 0.54*** | 0.42*** | 0.33*** | 0.24*** | 0.40*** | 0.36*** | 0.25*** | 0.45*** | 0.05 | 0.38*** | 0.29*** | -0.29*** | 0.53*** | 0.51*** | 0.59*** | 0.49*** | 0.18* | 0.46*** | 0.54*** | 0.04 | 0.34*** | 0.22** | 0.58*** | 0.50*** | 0.37*** | 0.36*** | 0.40*** | 0.44*** | 0.28*** | ||||
| WATERMELON SCHNAPPS | 0.42*** | 0.59*** | 0.17* | 0.53*** | 0.52*** | 0.50*** | 0.59*** | 0.29*** | 0.24** | 0.76*** | 0.19** | 0.02 | 0.18** | 0.68*** | 0.18* | 0.54*** | 0.30*** | 0.20** | 0.41*** | 0.44*** | 0.35*** | -0.04 | 0.36*** | -0.18* | 0.46*** | 0.55*** | 0.59*** | 0.62*** | 0.77*** | 0.31*** | -0.71 | 0.44*** | 0.62*** | 0.55*** | 0.71*** | 0.62*** | 0.68*** | 0.47*** | 0.50*** | 0.19** | 0.36*** | 0.67*** | 0.12 | 0.52*** | 0.36*** | -0.34*** | 0.77*** | 0.37*** | 0.61*** | 0.58*** | 0.18* | 0.46*** | 0.49*** | 0.24*** | 0.55*** | 0.37*** | 0.55*** | 0.57*** | 0.50*** | 0.60*** | 0.48*** | 0.72*** | 0.69*** | 0.43*** | |||
| WHISKEY LIQUEUR | 0.54*** | 0.88*** | 0.36*** | 0.72*** | 0.64*** | 0.51*** | 0.56*** | 0.44*** | 0.47*** | 0.69*** | 0.46*** | 0.13 | 0.51*** | 0.83*** | 0.17* | 0.65*** | 0.69*** | 0.48*** | 0.61*** | 0.70*** | 0.64*** | 0.05 | 0.51*** | -0.09 | 0.56*** | 0.59*** | 0.70*** | 0.68*** | 0.52*** | 0.42*** | -0.51 | 0.59*** | 0.62*** | 0.47*** | 0.84*** | 0.78*** | 0.65*** | 0.55*** | 0.61*** | 0.35*** | 0.66*** | 0.81*** | 0.23** | 0.57*** | 0.43*** | -0.31*** | 0.73*** | 0.66*** | 0.80*** | 0.55*** | 0.39*** | 0.61*** | 0.72*** | 0.22** | 0.71*** | 0.36*** | 0.82*** | 0.83*** | 0.58*** | 0.60*** | 0.75*** | 0.77*** | 0.61*** | 0.44*** | 0.61*** | ||
| WHITE CREME DE CACAO | 0.30*** | 0.50*** | 0.14 | 0.32*** | 0.22** | 0.27*** | 0.31*** | 0.23** | 0.27*** | 0.37*** | 0.29*** | -0.01 | 0.30*** | 0.44*** | 0.27*** | 0.43*** | 0.35*** | 0.53*** | 0.46*** | 0.37*** | 0.27*** | -0.03 | 0.39*** | -0.14* | 0.31*** | 0.26*** | 0.23** | 0.25*** | 0.28*** | 0.30*** | 0.70 | 0.32*** | 0.21** | 0.25*** | 0.43*** | 0.32*** | 0.27*** | 0.27*** | 0.31*** | 0.20** | 0.53*** | 0.42*** | 0.27*** | 0.28*** | 0.36*** | -0.14 | 0.43*** | 0.32*** | 0.35*** | 0.31*** | 0.43*** | 0.31*** | 0.30*** | 0.16* | 0.40*** | 0.15* | 0.39*** | 0.60*** | 0.33*** | 0.25*** | 0.26*** | 0.33*** | 0.21** | 0.13 | 0.28*** | 0.55*** | |
| WHITE CREME DE MENTHE | 0.29*** | 0.36*** | 0.15* | 0.39*** | 0.25*** | 0.26*** | 0.32*** | 0.27*** | 0.08 | 0.31*** | 0.29*** | 0.08 | 0.31*** | 0.37*** | 0.35*** | 0.29*** | 0.27*** | 0.19* | 0.33*** | 0.35*** | 0.25*** | 0.14 | 0.29*** | -0.10 | 0.15 | 0.18* | 0.29*** | 0.39*** | 0.20** | 0.41*** | NaNNA | 0.24** | 0.15* | 0.30*** | 0.33*** | 0.27*** | 0.22** | 0.22** | 0.33*** | 0.25*** | 0.26*** | 0.26*** | 0.10 | 0.23** | 0.15 | -0.14 | 0.32*** | 0.30*** | 0.34*** | 0.28*** | 0.19* | 0.27*** | 0.20** | 0.13 | 0.29*** | 0.09 | 0.35*** | 0.36*** | 0.17* | 0.26*** | 0.18* | 0.24** | 0.08 | 0.18* | 0.19* | 0.36*** | 0.41*** |
corr.cat.ts <- rcorr(as.matrix(gr_date_cat.spread[,c(3:70)]))$r
corr.cat.ts <- data.frame(corr.cat.ts)
corr.cat.ts.dec <- apply(corr.cat.ts, 2, sort, decreasing=T)
corr.cat.ts.inc <- apply(corr.cat.ts,2, sort)
corr.cat.ts.min <- lapply(corr.cat.ts.inc, function(x) x[1:3])
corr.cat.ts.max <- lapply(corr.cat.ts.dec, function(x) x[2:4])
In the following lists we display the highest correlations (positive and negative) per category.
#Maximum positive correlations per category
sapply(corr.cat.ts.max, function(x) names(x))
## X100.PROOF.VODKA X80.PROOF.VODKA
## [1,] "80 PROOF VODKA" "SPICED RUM"
## [2,] "PUERTO RICO & VIRGIN ISLANDS RUM" "PUERTO RICO & VIRGIN ISLANDS RUM"
## [3,] "AMERICAN AMARETTO" "STRAIGHT BOURBON WHISKIES"
## AMERICAN.ALCOHOL AMERICAN.AMARETTO
## [1,] "COFFEE LIQUEURS" "BLENDED WHISKIES"
## [2,] "CREAM LIQUEURS" "IMPORTED SCHNAPPS"
## [3,] "AMERICAN AMARETTO" "SPICED RUM"
## AMERICAN.COCKTAILS AMERICAN.DRY.GINS
## [1,] "FLAVORED RUM" "AMERICAN GRAPE BRANDIES"
## [2,] "80 PROOF VODKA" "IMPORTED GRAPE BRANDIES"
## [3,] "PUERTO RICO & VIRGIN ISLANDS RUM" "FLAVORED VODKA"
## AMERICAN.GRAPE.BRANDIES AMERICAN.SLOE.GINS
## [1,] "IMPORTED GRAPE BRANDIES" "BLENDED WHISKIES"
## [2,] "AMERICAN DRY GINS" "SPICED RUM"
## [3,] "GRAPE SCHNAPPS" "PEPPERMINT SCHNAPPS"
## ANISETTE APPLE.SCHNAPPS
## [1,] "STRAIGHT BOURBON WHISKIES" "PEACH SCHNAPPS"
## [2,] "MISC. IMPORTED CORDIALS & LIQUEURS" "BLENDED WHISKIES"
## [3,] "WHISKEY LIQUEUR" "IMPORTED SCHNAPPS"
## APRICOT.BRANDIES BARBADOS.RUM BLACKBERRY.BRANDIES
## [1,] "BLACKBERRY BRANDIES" "HIGH PROOF BEER" "APRICOT BRANDIES"
## [2,] "PEPPERMINT SCHNAPPS" "WHISKEY LIQUEUR" "PEPPERMINT SCHNAPPS"
## [3,] "CANADIAN WHISKIES" "JAMAICA RUM" "CANADIAN WHISKIES"
## BLENDED.WHISKIES BOTTLED.IN.BOND.BOURBON
## [1,] "IMPORTED SCHNAPPS" "AMERICAN GRAPE BRANDIES"
## [2,] "80 PROOF VODKA" "IMPORTED GRAPE BRANDIES"
## [3,] "PUERTO RICO & VIRGIN ISLANDS RUM" "AMERICAN DRY GINS"
## BUTTERSCOTCH.SCHNAPPS CANADIAN.WHISKIES
## [1,] "CINNAMON SCHNAPPS" "PUERTO RICO & VIRGIN ISLANDS RUM"
## [2,] "PEACH SCHNAPPS" "80 PROOF VODKA"
## [3,] "APPLE SCHNAPPS" "SPICED RUM"
## CHERRY.BRANDIES CINNAMON.SCHNAPPS
## [1,] "CANADIAN WHISKIES" "BUTTERSCOTCH SCHNAPPS"
## [2,] "STRAIGHT BOURBON WHISKIES" "PEPPERMINT SCHNAPPS"
## [3,] "BLACKBERRY BRANDIES" "SPICED RUM"
## COFFEE.LIQUEURS CREAM.LIQUEURS CREME.DE.ALMOND
## [1,] "PEPPERMINT SCHNAPPS" "COFFEE LIQUEURS" "GREEN CREME DE MENTHE"
## [2,] "AMERICAN AMARETTO" "PEPPERMINT SCHNAPPS" "BLACKBERRY BRANDIES"
## [3,] "IMPORTED SCHNAPPS" "SPICED RUM" "AMERICAN AMARETTO"
## DARK.CREME.DE.CACAO DECANTERS...SPECIALTY.PACKAGES
## [1,] "STRAIGHT BOURBON WHISKIES" "MISCELLANEOUS BRANDIES"
## [2,] "PEPPERMINT SCHNAPPS" "CREAM LIQUEURS"
## [3,] "MISC. IMPORTED CORDIALS & LIQUEURS" "AMERICAN ALCOHOL"
## DISTILLED.SPIRITS.SPECIALTY
## [1,] "MISC. IMPORTED CORDIALS & LIQUEURS"
## [2,] "80 PROOF VODKA"
## [3,] "PUERTO RICO & VIRGIN ISLANDS RUM"
## FLAVORED.GIN FLAVORED.RUM
## [1,] "BLENDED WHISKIES" "PUERTO RICO & VIRGIN ISLANDS RUM"
## [2,] "IMPORTED SCHNAPPS" "AMERICAN COCKTAILS"
## [3,] "PUERTO RICO & VIRGIN ISLANDS RUM" "80 PROOF VODKA"
## FLAVORED.VODKA GRAPE.SCHNAPPS
## [1,] "80 PROOF VODKA" "WATERMELON SCHNAPPS"
## [2,] "TEQUILA" "TRIPLE SEC"
## [3,] "PUERTO RICO & VIRGIN ISLANDS RUM" "AMERICAN GRAPE BRANDIES"
## GREEN.CREME.DE.MENTHE HIGH.PROOF.BEER
## [1,] "CREME DE ALMOND" "SINGLE BARREL BOURBON WHISKIES"
## [2,] "BLACKBERRY BRANDIES" "WHITE CREME DE CACAO"
## [3,] "AMERICAN AMARETTO" "DISTILLED SPIRITS SPECIALTY"
## IMPORTED.AMARETTO IMPORTED.DRY.GINS
## [1,] "IMPORTED SCHNAPPS" "IMPORTED VODKA"
## [2,] "STRAIGHT BOURBON WHISKIES" "TEQUILA"
## [3,] "BLENDED WHISKIES" "MISC. IMPORTED CORDIALS & LIQUEURS"
## IMPORTED.GRAPE.BRANDIES IMPORTED.SCHNAPPS
## [1,] "AMERICAN GRAPE BRANDIES" "BLENDED WHISKIES"
## [2,] "IMPORTED VODKA - MISC" "80 PROOF VODKA"
## [3,] "AMERICAN DRY GINS" "SPICED RUM"
## IMPORTED.VODKA IMPORTED.VODKA...MISC
## [1,] "MISC. IMPORTED CORDIALS & LIQUEURS" "TEQUILA"
## [2,] "PUERTO RICO & VIRGIN ISLANDS RUM" "TRIPLE SEC"
## [3,] "TEQUILA" "IMPORTED VODKA"
## IRISH.WHISKIES JAMAICA.RUM
## [1,] "AMERICAN DRY GINS" "IMPORTED SCHNAPPS"
## [2,] "BLENDED WHISKIES" "MISC. IMPORTED CORDIALS & LIQUEURS"
## [3,] "IMPORTED SCHNAPPS" "BLENDED WHISKIES"
## LOW.PROOF.VODKA MISC..AMERICAN.CORDIALS...LIQUEURS
## [1,] "SCOTCH WHISKIES" "STRAIGHT BOURBON WHISKIES"
## [2,] "BLACKBERRY BRANDIES" "WHISKEY LIQUEUR"
## [3,] "CANADIAN WHISKIES" "80 PROOF VODKA"
## MISC..IMPORTED.CORDIALS...LIQUEURS
## [1,] "PUERTO RICO & VIRGIN ISLANDS RUM"
## [2,] "TEQUILA"
## [3,] "IMPORTED VODKA"
## MISCELLANEOUS.BRANDIES
## [1,] "MISC. AMERICAN CORDIALS & LIQUEURS"
## [2,] "WHITE CREME DE CACAO"
## [3,] "BUTTERSCOTCH SCHNAPPS"
## MISCELLANEOUS.SCHNAPPS OTHER.PROOF.VODKA
## [1,] "BLENDED WHISKIES" "GREEN CREME DE MENTHE"
## [2,] "IMPORTED SCHNAPPS" "IMPORTED SCHNAPPS"
## [3,] "PUERTO RICO & VIRGIN ISLANDS RUM" "FLAVORED GIN"
## PEACH.BRANDIES PEACH.SCHNAPPS
## [1,] "DECANTERS & SPECIALTY PACKAGES" "APPLE SCHNAPPS"
## [2,] "BARBADOS RUM" "BLENDED WHISKIES"
## [3,] "CREME DE ALMOND" "IMPORTED SCHNAPPS"
## PEPPERMINT.SCHNAPPS PUERTO.RICO...VIRGIN.ISLANDS.RUM
## [1,] "SPICED RUM" "80 PROOF VODKA"
## [2,] "AMERICAN AMARETTO" "BLENDED WHISKIES"
## [3,] "80 PROOF VODKA" "MISC. IMPORTED CORDIALS & LIQUEURS"
## RASPBERRY.SCHNAPPS ROCK...RYE ROOT.BEER.SCHNAPPS
## [1,] "IMPORTED SCHNAPPS" "STRAIGHT BOURBON WHISKIES" "IMPORTED SCHNAPPS"
## [2,] "BLENDED WHISKIES" "80 PROOF VODKA" "80 PROOF VODKA"
## [3,] "AMERICAN AMARETTO" "PEPPERMINT SCHNAPPS" "BLENDED WHISKIES"
## SCOTCH.WHISKIES
## [1,] "PUERTO RICO & VIRGIN ISLANDS RUM"
## [2,] "80 PROOF VODKA"
## [3,] "STRAIGHT BOURBON WHISKIES"
## SINGLE.BARREL.BOURBON.WHISKIES
## [1,] "HIGH PROOF BEER"
## [2,] "IMPORTED VODKA - MISC"
## [3,] "MISC. IMPORTED CORDIALS & LIQUEURS"
## SINGLE.MALT.SCOTCH
## [1,] "BLENDED WHISKIES"
## [2,] "MISC. IMPORTED CORDIALS & LIQUEURS"
## [3,] "IMPORTED SCHNAPPS"
## SPEARMINT.SCHNAPPS
## [1,] "STRAWBERRY SCHNAPPS"
## [2,] "MISC. IMPORTED CORDIALS & LIQUEURS"
## [3,] "IRISH WHISKIES"
## SPICED.RUM STRAIGHT.BOURBON.WHISKIES
## [1,] "80 PROOF VODKA" "80 PROOF VODKA"
## [2,] "BLENDED WHISKIES" "BLENDED WHISKIES"
## [3,] "PUERTO RICO & VIRGIN ISLANDS RUM" "PUERTO RICO & VIRGIN ISLANDS RUM"
## STRAIGHT.RYE.WHISKIES STRAWBERRY.SCHNAPPS
## [1,] "APPLE SCHNAPPS" "TEQUILA"
## [2,] "BLENDED WHISKIES" "MISC. IMPORTED CORDIALS & LIQUEURS"
## [3,] "80 PROOF VODKA" "TRIPLE SEC"
## TENNESSEE.WHISKIES TEQUILA
## [1,] "WHISKEY LIQUEUR" "MISC. IMPORTED CORDIALS & LIQUEURS"
## [2,] "80 PROOF VODKA" "TRIPLE SEC"
## [3,] "SPICED RUM" "IMPORTED VODKA - MISC"
## TRIPLE.SEC TROPICAL.FRUIT.SCHNAPPS
## [1,] "TEQUILA" "FLAVORED RUM"
## [2,] "IMPORTED VODKA - MISC" "AMERICAN COCKTAILS"
## [3,] "MISC. IMPORTED CORDIALS & LIQUEURS" "AMERICAN AMARETTO"
## WATERMELON.SCHNAPPS WHISKEY.LIQUEUR
## [1,] "GRAPE SCHNAPPS" "80 PROOF VODKA"
## [2,] "PEACH SCHNAPPS" "IMPORTED SCHNAPPS"
## [3,] "APPLE SCHNAPPS" "STRAIGHT BOURBON WHISKIES"
## WHITE.CREME.DE.CACAO WHITE.CREME.DE.MENTHE
## [1,] "HIGH PROOF BEER" "WHITE CREME DE CACAO"
## [2,] "STRAIGHT BOURBON WHISKIES" "GREEN CREME DE MENTHE"
## [3,] "WHISKEY LIQUEUR" "AMERICAN AMARETTO"
Looking at the results, I was again happy to see that the highest correlated liquor with Triple sec is Tequila (0.86) - Margarita cocktails are very popular in Iowa also it seems! The correlation of Triple sec with “Imported Vodka”" was notably high as well (0.82).
Looking at the top correlations for Tequila, we see cocktail mixers (cordials/liqueurs and triple sec) and the third one is also imported vodka.
A quick look at the negative correlations now.
#Maximum negative correlations per categor
sapply(corr.cat.ts.min, function(x) names(x))
## X100.PROOF.VODKA X80.PROOF.VODKA
## [1,] "HIGH PROOF BEER" "HIGH PROOF BEER"
## [2,] "PEACH BRANDIES" "PEACH BRANDIES"
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
## AMERICAN.ALCOHOL AMERICAN.AMARETTO
## [1,] "PEACH BRANDIES" "HIGH PROOF BEER"
## [2,] "BARBADOS RUM" "PEACH BRANDIES"
## [3,] "SPEARMINT SCHNAPPS" "DECANTERS & SPECIALTY PACKAGES"
## AMERICAN.COCKTAILS AMERICAN.DRY.GINS
## [1,] "PEACH BRANDIES" "PEACH BRANDIES"
## [2,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
## [3,] "CREME DE ALMOND" "CREME DE ALMOND"
## AMERICAN.GRAPE.BRANDIES AMERICAN.SLOE.GINS
## [1,] "DECANTERS & SPECIALTY PACKAGES" "HIGH PROOF BEER"
## [2,] "PEACH BRANDIES" "PEACH BRANDIES"
## [3,] "HIGH PROOF BEER" "DECANTERS & SPECIALTY PACKAGES"
## ANISETTE APPLE.SCHNAPPS
## [1,] "HIGH PROOF BEER" "HIGH PROOF BEER"
## [2,] "PEACH BRANDIES" "PEACH BRANDIES"
## [3,] "CREME DE ALMOND" "DECANTERS & SPECIALTY PACKAGES"
## APRICOT.BRANDIES BARBADOS.RUM
## [1,] "HIGH PROOF BEER" "CREME DE ALMOND"
## [2,] "PEACH BRANDIES" "DARK CREME DE CACAO"
## [3,] "DECANTERS & SPECIALTY PACKAGES" "BLACKBERRY BRANDIES"
## BLACKBERRY.BRANDIES BLENDED.WHISKIES
## [1,] "HIGH PROOF BEER" "HIGH PROOF BEER"
## [2,] "PEACH BRANDIES" "PEACH BRANDIES"
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
## BOTTLED.IN.BOND.BOURBON BUTTERSCOTCH.SCHNAPPS
## [1,] "CREME DE ALMOND" "HIGH PROOF BEER"
## [2,] "PEACH BRANDIES" "PEACH BRANDIES"
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
## CANADIAN.WHISKIES CHERRY.BRANDIES
## [1,] "HIGH PROOF BEER" "HIGH PROOF BEER"
## [2,] "PEACH BRANDIES" "PEACH BRANDIES"
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
## CINNAMON.SCHNAPPS COFFEE.LIQUEURS
## [1,] "HIGH PROOF BEER" "HIGH PROOF BEER"
## [2,] "PEACH BRANDIES" "PEACH BRANDIES"
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
## CREAM.LIQUEURS CREME.DE.ALMOND
## [1,] "HIGH PROOF BEER" "SPEARMINT SCHNAPPS"
## [2,] "PEACH BRANDIES" "DECANTERS & SPECIALTY PACKAGES"
## [3,] "BARBADOS RUM" "BOTTLED IN BOND BOURBON"
## DARK.CREME.DE.CACAO DECANTERS...SPECIALTY.PACKAGES
## [1,] "HIGH PROOF BEER" "DISTILLED SPIRITS SPECIALTY"
## [2,] "PEACH BRANDIES" "AMERICAN COCKTAILS"
## [3,] "DECANTERS & SPECIALTY PACKAGES" "CREME DE ALMOND"
## DISTILLED.SPIRITS.SPECIALTY FLAVORED.GIN
## [1,] "DECANTERS & SPECIALTY PACKAGES" "PEACH BRANDIES"
## [2,] "PEACH BRANDIES" "HIGH PROOF BEER"
## [3,] "BARBADOS RUM" "DECANTERS & SPECIALTY PACKAGES"
## FLAVORED.RUM FLAVORED.VODKA
## [1,] "HIGH PROOF BEER" "PEACH BRANDIES"
## [2,] "PEACH BRANDIES" "HIGH PROOF BEER"
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
## GRAPE.SCHNAPPS GREEN.CREME.DE.MENTHE
## [1,] "PEACH BRANDIES" "DECANTERS & SPECIALTY PACKAGES"
## [2,] "DECANTERS & SPECIALTY PACKAGES" "PEACH BRANDIES"
## [3,] "APRICOT BRANDIES" "BARBADOS RUM"
## HIGH.PROOF.BEER IMPORTED.AMARETTO
## [1,] "BUTTERSCOTCH SCHNAPPS" "HIGH PROOF BEER"
## [2,] "CREAM LIQUEURS" "PEACH BRANDIES"
## [3,] "CINNAMON SCHNAPPS" "DECANTERS & SPECIALTY PACKAGES"
## IMPORTED.DRY.GINS IMPORTED.GRAPE.BRANDIES
## [1,] "HIGH PROOF BEER" "PEACH BRANDIES"
## [2,] "PEACH BRANDIES" "DECANTERS & SPECIALTY PACKAGES"
## [3,] "CREME DE ALMOND" "CREME DE ALMOND"
## IMPORTED.SCHNAPPS IMPORTED.VODKA
## [1,] "HIGH PROOF BEER" "HIGH PROOF BEER"
## [2,] "PEACH BRANDIES" "PEACH BRANDIES"
## [3,] "DECANTERS & SPECIALTY PACKAGES" "CREME DE ALMOND"
## IMPORTED.VODKA...MISC IRISH.WHISKIES
## [1,] "PEACH BRANDIES" "HIGH PROOF BEER"
## [2,] "CREME DE ALMOND" "PEACH BRANDIES"
## [3,] "HIGH PROOF BEER" "DECANTERS & SPECIALTY PACKAGES"
## JAMAICA.RUM LOW.PROOF.VODKA
## [1,] "HIGH PROOF BEER" "PEACH BRANDIES"
## [2,] "PEACH BRANDIES" "HIGH PROOF BEER"
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
## MISC..AMERICAN.CORDIALS...LIQUEURS MISC..IMPORTED.CORDIALS...LIQUEURS
## [1,] "HIGH PROOF BEER" "HIGH PROOF BEER"
## [2,] "PEACH BRANDIES" "PEACH BRANDIES"
## [3,] "CREME DE ALMOND" "DECANTERS & SPECIALTY PACKAGES"
## MISCELLANEOUS.BRANDIES MISCELLANEOUS.SCHNAPPS
## [1,] "CREME DE ALMOND" "HIGH PROOF BEER"
## [2,] "IMPORTED DRY GINS" "PEACH BRANDIES"
## [3,] "ROOT BEER SCHNAPPS" "DECANTERS & SPECIALTY PACKAGES"
## OTHER.PROOF.VODKA PEACH.BRANDIES
## [1,] "PEACH BRANDIES" "IMPORTED SCHNAPPS"
## [2,] "DECANTERS & SPECIALTY PACKAGES" "SCOTCH WHISKIES"
## [3,] "BARBADOS RUM" "AMERICAN AMARETTO"
## PEACH.SCHNAPPS PEPPERMINT.SCHNAPPS
## [1,] "HIGH PROOF BEER" "HIGH PROOF BEER"
## [2,] "PEACH BRANDIES" "PEACH BRANDIES"
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
## PUERTO.RICO...VIRGIN.ISLANDS.RUM RASPBERRY.SCHNAPPS
## [1,] "HIGH PROOF BEER" "HIGH PROOF BEER"
## [2,] "PEACH BRANDIES" "PEACH BRANDIES"
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
## ROCK...RYE ROOT.BEER.SCHNAPPS
## [1,] "PEACH BRANDIES" "PEACH BRANDIES"
## [2,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
## [3,] "BARBADOS RUM" "MISCELLANEOUS BRANDIES"
## SCOTCH.WHISKIES SINGLE.BARREL.BOURBON.WHISKIES
## [1,] "HIGH PROOF BEER" "PEACH BRANDIES"
## [2,] "PEACH BRANDIES" "CREME DE ALMOND"
## [3,] "DECANTERS & SPECIALTY PACKAGES" "LOW PROOF VODKA"
## SINGLE.MALT.SCOTCH SPEARMINT.SCHNAPPS
## [1,] "HIGH PROOF BEER" "HIGH PROOF BEER"
## [2,] "PEACH BRANDIES" "CREME DE ALMOND"
## [3,] "CREME DE ALMOND" "DECANTERS & SPECIALTY PACKAGES"
## SPICED.RUM STRAIGHT.BOURBON.WHISKIES
## [1,] "HIGH PROOF BEER" "HIGH PROOF BEER"
## [2,] "PEACH BRANDIES" "PEACH BRANDIES"
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
## STRAIGHT.RYE.WHISKIES STRAWBERRY.SCHNAPPS
## [1,] "PEACH BRANDIES" "HIGH PROOF BEER"
## [2,] "DECANTERS & SPECIALTY PACKAGES" "PEACH BRANDIES"
## [3,] "BARBADOS RUM" "DECANTERS & SPECIALTY PACKAGES"
## TENNESSEE.WHISKIES TEQUILA
## [1,] "PEACH BRANDIES" "HIGH PROOF BEER"
## [2,] "DECANTERS & SPECIALTY PACKAGES" "PEACH BRANDIES"
## [3,] "HIGH PROOF BEER" "DECANTERS & SPECIALTY PACKAGES"
## TRIPLE.SEC TROPICAL.FRUIT.SCHNAPPS
## [1,] "HIGH PROOF BEER" "HIGH PROOF BEER"
## [2,] "PEACH BRANDIES" "PEACH BRANDIES"
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
## WATERMELON.SCHNAPPS WHISKEY.LIQUEUR
## [1,] "HIGH PROOF BEER" "HIGH PROOF BEER"
## [2,] "PEACH BRANDIES" "PEACH BRANDIES"
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
## WHITE.CREME.DE.CACAO WHITE.CREME.DE.MENTHE
## [1,] "DECANTERS & SPECIALTY PACKAGES" "PEACH BRANDIES"
## [2,] "PEACH BRANDIES" "DECANTERS & SPECIALTY PACKAGES"
## [3,] "CREME DE ALMOND" "BARBADOS RUM"
Unfortunately liquors which are outliers in terms of the very low quantities ordered, in generate correlate highly negatively with the other categories.
Looking at Tequila’s top negative correlations we see high proof beer, peach brandies and decanters & specialty packages.
It would be useful to plot tequila against its highest positive and negative correlations and see what we get
#Produce plots
test1 <- ggplot(data=gr_date_cat.spread[,c(64,44)],aes(x=`MISC. IMPORTED CORDIALS & LIQUEURS`, y=TEQUILA)) +
geom_point() +
geom_smooth(method=lm) +
geom_text(x = 1500, y = 10000,
label = lm_eqn(lm(TEQUILA~`MISC. IMPORTED CORDIALS & LIQUEURS`,gr_date_cat.spread[c(64,44)])),
parse = TRUE, size = 3)
test2 <- ggplot(data=gr_date_cat.spread[,c(64,65)],aes(x=`TRIPLE SEC`, y=TEQUILA)) +
geom_point() +
geom_smooth(method=lm) +
geom_text(x = 1500, y = 10000,
label = lm_eqn(lm(TEQUILA~`TRIPLE SEC`,gr_date_cat.spread[c(64,65)])),
parse = TRUE, size = 3)
test3 <- ggplot(data=gr_date_cat.spread[,c(64,39)],aes(x=`IMPORTED VODKA - MISC`, y=TEQUILA)) +
geom_point() +
geom_smooth(method=lm) +
geom_text(x = 1500, y = 10000,
label = lm_eqn(lm(TEQUILA~`IMPORTED VODKA - MISC`,gr_date_cat.spread[c(64,39)])),
parse = TRUE, size = 3)
test4 <- ggplot(data=gr_date_cat.spread[,c(64,33)],aes(x=`HIGH PROOF BEER`, y=TEQUILA)) +
geom_point() +
geom_smooth(method=lm, colour="red") +
geom_text(x = 5, y = 15000,
label = lm_eqn(lm(TEQUILA~`HIGH PROOF BEER`,gr_date_cat.spread[c(64,33)])),
parse = TRUE, size = 3)
test5 <- ggplot(data=gr_date_cat.spread[,c(64,48)],aes(x=`PEACH BRANDIES`, y=TEQUILA)) +
geom_point() +
geom_smooth(method=lm, colour="red") +
geom_text(x = 500, y = 10000,
label = lm_eqn(lm(TEQUILA~`PEACH BRANDIES`,gr_date_cat.spread[c(64,48)])),
parse = TRUE, size = 3)
test6 <- ggplot(data=gr_date_cat.spread[,c(64,26)],aes(x=`DECANTERS & SPECIALTY PACKAGES`, y=TEQUILA)) +
geom_point() +
geom_smooth(method=lm, colour="red") +
geom_text(x = 5000, y = 12000,
label = lm_eqn(lm(TEQUILA~`DECANTERS & SPECIALTY PACKAGES`,gr_date_cat.spread[c(64,26)])),
parse = TRUE, size = 3)
multiplot(test1, test2, test3, test4, test5, test6, cols=2)
Here we see that for the highest positive correlations there is indeed a clear linear trend. In the contrary, the highest negative correlations do not really display a strong linear trend.
Out of curiosity, let’s produce the relevant displays for 80 proof vodka.
test1 <- ggplot(data=gr_date_cat.spread[,c(4,59)],aes(x=`SPICED RUM`, y=`80 PROOF VODKA`)) +
geom_point() +
geom_smooth(method=lm) +
geom_text(x = 1500, y = 40000,
label = lm_eqn(lm(`80 PROOF VODKA`~`SPICED RUM`,gr_date_cat.spread[c(4,59)])),
parse = TRUE, size = 3)
test2 <- ggplot(data=gr_date_cat.spread[,c(4,51)],aes(x=`PUERTO RICO & VIRGIN ISLANDS RUM`, y=`80 PROOF VODKA`)) +
geom_point() +
geom_smooth(method=lm) +
geom_text(x = 1500, y = 40000,
label = lm_eqn(lm(`80 PROOF VODKA`~`PUERTO RICO & VIRGIN ISLANDS RUM`,gr_date_cat.spread[c(4,51)])),
parse = TRUE, size = 3)
test3 <- ggplot(data=gr_date_cat.spread[,c(4,60)],aes(x=`STRAIGHT BOURBON WHISKIES`, y=`80 PROOF VODKA`)) +
geom_point() +
geom_smooth(method=lm) +
geom_text(x = 1500, y = 40000,
label = lm_eqn(lm(`80 PROOF VODKA`~`STRAIGHT BOURBON WHISKIES`,gr_date_cat.spread[c(4,60)])),
parse = TRUE, size = 3)
test4 <- ggplot(data=gr_date_cat.spread[,c(4,33)],aes(x=`HIGH PROOF BEER`, y=`80 PROOF VODKA`)) +
geom_point() +
geom_smooth(method=lm, colour="red") +
geom_text(x = 5, y = 15000,
label = lm_eqn(lm(`80 PROOF VODKA`~`HIGH PROOF BEER`,gr_date_cat.spread[c(4,33)])),
parse = TRUE, size = 3)
test5 <- ggplot(data=gr_date_cat.spread[,c(4,48)],aes(x=`PEACH BRANDIES`, y=`80 PROOF VODKA`)) +
geom_point() +
geom_smooth(method=lm, colour="red") +
geom_text(x = 500, y = 10000,
label = lm_eqn(lm(`80 PROOF VODKA`~`PEACH BRANDIES`,gr_date_cat.spread[c(4,48)])),
parse = TRUE, size = 3)
test6 <- ggplot(data=gr_date_cat.spread[,c(4,26)],aes(x=`DECANTERS & SPECIALTY PACKAGES`, y=`80 PROOF VODKA`)) +
geom_point() +
geom_smooth(method=lm, colour="red") +
geom_text(x = 5000, y = 12000,
label = lm_eqn(lm(`80 PROOF VODKA`~`DECANTERS & SPECIALTY PACKAGES`,gr_date_cat.spread[c(4,26)])),
parse = TRUE, size = 3)
multiplot(test1, test2, test3, test4, test5, test6, cols=2)
Spiced rum, Puerto Rico rum and straight bourbon whiskies are the highest positive correlations here and one can see a clear linear trend with very high R-squared values. Coincidentally (?) the same three categories are the highest negative correlations. Whilst high proof beer is most definitely an outlier in terms of either data recording or actual frequency of orders, peach brandies and decanters appear probably uncorrelated altogether. To be fair there does appear to be some negative association for peach brandies, but if one looks at the highest negative correlations for all categories, these two appear extremely frequently - so they are genuine outliers in terms of order patterns.
The plot above shows an overlay of two time-series demonstrating measures of bottle quantities ordered. Specifically:
It is clear that when the quantities were population-adjusted, a clear periodicaal pattern is observable. Quantities of bottles ordered per citizen always peak on Tuesdays (annotated by the overlay of dashed lines). Monday and Wednesday are pretty much the same; after Wednesday the quantities decline. This pattern does makes sense: Monday is the first day of the week and the stores calculate their weekly requirements by the end of the day. They then submit the order either at end of business or early Tuesday; this allows time for the bottles to be delivered prior to the weekend. They shouldn’t need to order too many bottles after the early week’s budgeting.
This time-series of boxplots shows the monthly number of orders submitted by liquor stores; the colour of the boxplots shows the number of participating stores on that particular month.
It is evident that some data-recording issue takes place in June (perhaps relevant to fiscal years?), as orders reduce dramatically and do not recover later in the year.
This plot shows that it is best to separate the dataset into 2 periods (Jan-May, June-Dec) for a more robust time-series analysis.
This composite plot shows time series of bottle quantities ordered for all liquor categories in the data; also here the seasons are overlaid in order to aid the visual interpretation of the individual plots. There are many investigations one can partake in with these data, but at an elementary level the liquors can be categorised to “winter” and “summer” drinks.
We can identify that certain liquor types have a demand curve that only peaks in December, for example liqueurs with various flavours like crème de menthe/cacao/etc., brandies and some types of schnapps (cinnamon, butterscotch). Given that these types of liquors have a heavier taste and are generally considered “warming”, their demand curve makes sense - we have identified the “winter drinks”.
Another clear category of demand patterns is observable for a different set of liquors (tequila, triple sec, flavoured rums, imported vodka), where bottle quantities are positively correlated with expected average temperatures (demand lowest in winter, climbs gradually in spring, peaks between mid-spring/mid-summer, gradually descends in autumn). As opposed to the previous category, these are the “summer drinks”.
Due to time limitations this project as it was submitted is not thorough, in the sense that most insights are derived from superficial analyses. Most of my observations are intuitive, and I have not succeeded in revealing any hidden patterns within the time-frame (perhaps with the exception of the clear periodicity of the total population-adjusted bottle quantity orders) - I fell into many dead-end analysis attempts.
In general, I think that identifying the auxilliary datasets and merging them to the master took a longer time than I expected and this limited the amount of time I could dedicate to the actual analysis. However, I think that combining the Iowa Liquor Sales data with the auxiliaries has provided a large, granular and rich data set that enables anyone interested to play with the data and see what they find - I am sharing it on Dropbox.
This project gave me the opportunity to hone my skills in R data wrangling (packages data.table, dplyr are awesome) and data visualisation (ggplot2); not a lot was done in terms of statistical modelling though. I was also happy I got to use the stuff I learnt in a previous Udacity course in scraping tables and request data using API’s. Additionally, I started to use git and GitHub - the integration in RStudio is great.
A little disclaimer about the project.Rdata file I’ve added to the repo; I’ve dropped quite a few columns in order to reduce file-size and conform with GitHub rules - it allows faster knitting of the .rmd that would otherwise take ages. (For the full enriched dataset please see Dropbox link).